# This code shows some simple data cleaning procedures

Note that, in order to make the code simple, many details (sometimes quite important) are not considered.

The input file used as example is the one generated in the TwitterFriends example


## Main items introduced
1. Simple data cleaning strategies
2. Reading and writing dataframes from CSV files
  * pd.read_csv
  * df.to_csv
3. Exploring dataframes
  * df - shows beginning and end of table
  * df.head() - shows the first five rows
  * df\[&lt;column_name&gt;\] - shows the column &lt;column_name&gt;
  * df\[&lt;column_name&gt;\]\[&lt;row number&gt;\] - shows the content of the cell
  * df.dtypes - shows the type of each column
  * df\[&lt;column_name&gt;\].unique() - shows unique values of the items in the column &lt;column_name&gt;
  * df\[df.&lt;column_name&gt; == '&lt;value&gt;'\] - returns all rows that have the value &lt;value&gt; in the column &lt;column_name&gt;. Note that the condition in the square bracket may be complex, see for example code line 21
4. Manipulating dataframes
  * df.copy() - creates a copy of the datframe
  * df.dropna(PARAM) - rows or columns based on NaN (no value assigned)
  * df.drop(PARAM) - drop rows or columns
  * df.replace(PARAM) - replaces the content of cells
  * df\['&lt;column_name&gt;'\]\[&lt;row number&gt;\] = '&lt;value&gt;' - assignes the value &lt;value&gt; to the cell
5. Simple regular expressions

In [1]:
import pandas as pd
import numpy as np

In [2]:
# read the file generated with the twitter API in a data frame and show first five records
df = pd.read_csv('dataFromTwitter.txt', sep='\t')
df.head()

Unnamed: 0,screen_name,name,followers_count,location,status
0,TuckerCarlson,Tucker Carlson,1064787.0,"Washington, DC",Thanks for joining! https://t.co/YNNjGghUxR
1,jessebwatters,Jesse Watters,410410.0,,RT @FNC_Ladies_Rule: 🍾🎉Fox News is No. 1 Basic...
2,https://t.co/cbHoqt3QTG https://t.co/ODX77…,,,,
3,WhiteHouse,The White House,15327283.0,"Washington, D.C.","Lt. Gen. (Ret) Keith Kellogg: ""Trump as Comman..."
4,Scavino45,Dan Scavino Jr.,148667.0,@Twitter @DanScavino,'It is time to heal the wounds that have divid...


In [3]:
# show the entire data frame - note the summary (number of row and columns) at the end
df

Unnamed: 0,screen_name,name,followers_count,location,status
0,TuckerCarlson,Tucker Carlson,1064787,"Washington, DC",Thanks for joining! https://t.co/YNNjGghUxR
1,jessebwatters,Jesse Watters,410410,,RT @FNC_Ladies_Rule: 🍾🎉Fox News is No. 1 Basic...
2,https://t.co/cbHoqt3QTG https://t.co/ODX77…,,,,
3,WhiteHouse,The White House,15327283,"Washington, D.C.","Lt. Gen. (Ret) Keith Kellogg: ""Trump as Comman..."
4,Scavino45,Dan Scavino Jr.,148667,@Twitter @DanScavino,'It is time to heal the wounds that have divid...
5,KellyannePolls,Kellyanne Conway,1664393,"Washington, DC",Important read: Gen. Kellogg combines personal...
6,Reince,Reince Priebus,941927,"Kenosha, WI and Washington, DC","Thanks to @POTUS, Foxconn, @ScottWalker, @PRya..."
7,✔️$10B investment,,,,
8,✔️13K jobs,,,,
9,✔️ #MadeInAmerica,,,,


In [4]:
# each column can be accessed by providing its name
df['name']

0            Tucker Carlson
1             Jesse Watters
2                       NaN
3           The White House
4           Dan Scavino Jr.
5          Kellyanne Conway
6            Reince Priebus
7                       NaN
8                       NaN
9                       NaN
10              Roma Downey
11       Trump Organization
12               Trump Golf
13     Tiffany Ariana Trump
14           Laura Ingraham
15               Mike Pence
16      Official Team Trump
17            DRUDGE REPORT
18            Vanessa Trump
19               Lara Trump
20             Sean Hannity
21               Fox Nation
22     Corey R. Lewandowski
23              Ann Coulter
24        Diamond and Silk®
25          KATRINA CAMPINS
26          Katrina Pierson
27            Michael Cohen
28            FOX & friends
29            MELANIA TRUMP
               ...         
74               Lara Trump
75             Sean Hannity
76     Corey R. Lewandowski
77        Diamond and Silk®
78          KATRINA 

In [5]:
# each element can be accessed by providing the name of the column and the row number
df['name'][96]

'Trump Los Angeles'

In [6]:
# Every column has a type (in this case all columns contain objects)
df.dtypes

screen_name        object
name               object
followers_count    object
location           object
status             object
dtype: object

#### Do exercises 1,2 and 3 in the CleaningDataExercises notebook


In [7]:
# make a copy of the data frame
df1 = df.copy()
# dropna allows to drop lines (or columns if axis=1) that contain non assigned values (NaN)
# in this case I have indicated to drop the lines that have NaN in the location column (subset=['location'])
df2 = df1.dropna(subset=['location'], axis=0, how='any')

In [8]:
df2

Unnamed: 0,screen_name,name,followers_count,location,status
0,TuckerCarlson,Tucker Carlson,1064787,"Washington, DC",Thanks for joining! https://t.co/YNNjGghUxR
3,WhiteHouse,The White House,15327283,"Washington, D.C.","Lt. Gen. (Ret) Keith Kellogg: ""Trump as Comman..."
4,Scavino45,Dan Scavino Jr.,148667,@Twitter @DanScavino,'It is time to heal the wounds that have divid...
5,KellyannePolls,Kellyanne Conway,1664393,"Washington, DC",Important read: Gen. Kellogg combines personal...
6,Reince,Reince Priebus,941927,"Kenosha, WI and Washington, DC","Thanks to @POTUS, Foxconn, @ScottWalker, @PRya..."
10,RealRomaDowney,Roma Downey,192126,Malibu,Kindness matters 🦋 #bekind https://t.co/telXo6...
11,Trump,Trump Organization,254702,"New York, NY",Brighten up your mornings in the Windy City wi...
14,IngrahamAngle,Laura Ingraham,1664852,DC,"Is it ""normalcy"" to govern by by ""continuing r..."
16,TeamTrump,Official Team Trump,742514,USA,RT @WhiteHouse: Watch LIVE as President Trump ...
17,DRUDGE_REPORT,DRUDGE REPORT,1319753,US,Now ISIS threatens Rome... https://t.co/ZOjKSk...


In [9]:
# remove the status column
df2.drop(['status'], inplace=True, axis=1)
df2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,screen_name,name,followers_count,location
0,TuckerCarlson,Tucker Carlson,1064787,"Washington, DC"
3,WhiteHouse,The White House,15327283,"Washington, D.C."
4,Scavino45,Dan Scavino Jr.,148667,@Twitter @DanScavino
5,KellyannePolls,Kellyanne Conway,1664393,"Washington, DC"
6,Reince,Reince Priebus,941927,"Kenosha, WI and Washington, DC"
10,RealRomaDowney,Roma Downey,192126,Malibu
11,Trump,Trump Organization,254702,"New York, NY"
14,IngrahamAngle,Laura Ingraham,1664852,DC
16,TeamTrump,Official Team Trump,742514,USA
17,DRUDGE_REPORT,DRUDGE REPORT,1319753,US


In [10]:
# remove 4rth row which has a strange location
df2.drop([4], inplace=True, axis=0)
df2

Unnamed: 0,screen_name,name,followers_count,location
0,TuckerCarlson,Tucker Carlson,1064787,"Washington, DC"
3,WhiteHouse,The White House,15327283,"Washington, D.C."
5,KellyannePolls,Kellyanne Conway,1664393,"Washington, DC"
6,Reince,Reince Priebus,941927,"Kenosha, WI and Washington, DC"
10,RealRomaDowney,Roma Downey,192126,Malibu
11,Trump,Trump Organization,254702,"New York, NY"
14,IngrahamAngle,Laura Ingraham,1664852,DC
16,TeamTrump,Official Team Trump,742514,USA
17,DRUDGE_REPORT,DRUDGE REPORT,1319753,US
19,LaraLeaTrump,Lara Trump,301998,New York City


In [11]:
# check the unique values in the column location
df2['location'].unique()

array(['Washington, DC', 'Washington, D.C.',
       'Kenosha, WI and Washington, DC', 'Malibu', 'New York, NY', 'DC',
       'USA', 'US', 'New York City', 'NYC', 'Los Angeles/NYC',
       'United States', 'Peace Within', 'New York',
       '                 NYC', 'Los Angeles', 'South Africa',
       'Greenwich, Conn.', 'Waikiki, Oahu', 'Miami, Florida',
       'Charlotte, NC', 'Las Vegas ', 'Chicago, IL', 'Sterling, VA',
       'Los Angeles, CA', 'Manhasset, NY', 'Washington',
       'London, Newick, LA. ', 'location', '1600 Pennsylvania Avenue ',
       'USA🇺🇸', 'California, USA', 'London, Newick, LA.'], dtype=object)

Washington appears in many different formats
the lines below replace all occurences of Washington to be the same
I have used the function replace of the pandas library (see http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html ) and a regular expression to indicate that any sequence starting by Wash and followed by any sequence of characters should be replaced by 'Washington, DC'. A good tutprial on regular expressions is at https://regexone.com/ 

In [12]:
df3=df2.replace(to_replace=r'Wash.+',value='Washington, DC', regex=True)
df3['location'].unique()

array(['Washington, DC', 'Kenosha, WI and Washington, DC', 'Malibu',
       'New York, NY', 'DC', 'USA', 'US', 'New York City', 'NYC',
       'Los Angeles/NYC', 'United States', 'Peace Within', 'New York',
       '                 NYC', 'Los Angeles', 'South Africa',
       'Greenwich, Conn.', 'Waikiki, Oahu', 'Miami, Florida',
       'Charlotte, NC', 'Las Vegas ', 'Chicago, IL', 'Sterling, VA',
       'Los Angeles, CA', 'Manhasset, NY', 'London, Newick, LA. ',
       'location', '1600 Pennsylvania Avenue ', 'USA🇺🇸',
       'California, USA', 'London, Newick, LA.'], dtype=object)

In [13]:
# note that df2 has not changed
df2['location'].unique()

array(['Washington, DC', 'Washington, D.C.',
       'Kenosha, WI and Washington, DC', 'Malibu', 'New York, NY', 'DC',
       'USA', 'US', 'New York City', 'NYC', 'Los Angeles/NYC',
       'United States', 'Peace Within', 'New York',
       '                 NYC', 'Los Angeles', 'South Africa',
       'Greenwich, Conn.', 'Waikiki, Oahu', 'Miami, Florida',
       'Charlotte, NC', 'Las Vegas ', 'Chicago, IL', 'Sterling, VA',
       'Los Angeles, CA', 'Manhasset, NY', 'Washington',
       'London, Newick, LA. ', 'location', '1600 Pennsylvania Avenue ',
       'USA🇺🇸', 'California, USA', 'London, Newick, LA.'], dtype=object)

In [14]:
# replace any DC at the beginning of the string with Washington, DC
# replace NYC preceded by any number of white spaces with New York, NY
# replace a string starting and finishing with US with USA
# replace USA🇺🇸 with USA
# replace New York City with New York, NY
# replace a string starting and finishing with New York with New York, NY
df4=df3.replace(regex={r'^DC': 'Washington, DC', '\s*NYC': 'New York, NY', '^US$':'USA', 'USA🇺🇸': 'USA', 'New York City': 'New York, NY', '^New York$':'New York, NY'})
df4['location'].unique()

array(['Washington, DC', 'Kenosha, WI and Washington, DC', 'Malibu',
       'New York, NY', 'USA', 'Los Angeles/New York, NY', 'United States',
       'Peace Within', 'Los Angeles', 'South Africa', 'Greenwich, Conn.',
       'Waikiki, Oahu', 'Miami, Florida', 'Charlotte, NC', 'Las Vegas ',
       'Chicago, IL', 'Sterling, VA', 'Los Angeles, CA', 'Manhasset, NY',
       'London, Newick, LA. ', 'location', '1600 Pennsylvania Avenue ',
       'California, USA', 'London, Newick, LA.'], dtype=object)

In [15]:
# Explore which are the entities with a certain location value (e.g. which ones have USA as location)
# (for each one of the rows in df4, [df4.location == 'USA'] returns true or false; only the rows corresponding to True are returned)
df4[df4.location == 'USA']

Unnamed: 0,screen_name,name,followers_count,location
16,TeamTrump,Official Team Trump,742514,USA
17,DRUDGE_REPORT,DRUDGE REPORT,1319753,USA
61,KellyannePolls,Kellyanne Conway,2584390,USA
69,TeamTrump,Official Team Trump,956501,USA
72,DRUDGE_REPORT,DRUDGE REPORT,1403331,USA


In [16]:
# One location value is location, I want to see to what it corresponds
df4[df4.location == 'location']

Unnamed: 0,screen_name,name,followers_count,location
50,screen_name,name,followers_count,location


In [17]:
# I drop the line 50
df4.drop([50], inplace=True, axis=0)
df4

Unnamed: 0,screen_name,name,followers_count,location
0,TuckerCarlson,Tucker Carlson,1064787,"Washington, DC"
3,WhiteHouse,The White House,15327283,"Washington, DC"
5,KellyannePolls,Kellyanne Conway,1664393,"Washington, DC"
6,Reince,Reince Priebus,941927,"Kenosha, WI and Washington, DC"
10,RealRomaDowney,Roma Downey,192126,Malibu
11,Trump,Trump Organization,254702,"New York, NY"
14,IngrahamAngle,Laura Ingraham,1664852,"Washington, DC"
16,TeamTrump,Official Team Trump,742514,USA
17,DRUDGE_REPORT,DRUDGE REPORT,1319753,USA
19,LaraLeaTrump,Lara Trump,301998,"New York, NY"


In [18]:
# Now I want to eliminate the double locations
# From the table above I see that Reince Priebus has two locations but he also appears twice so I assign to 
# one occurence Kenosha, WI and to the other one Washington, DC
#
# in the column 'location' on line 6 assign 'Kenosha, WI'
df4['location'][6] = 'Kenosha, WI'
# in the column 'location' on line 62 assign Washington, DC'
df4['location'][62] = 'Washington, DC'
df4

Unnamed: 0,screen_name,name,followers_count,location
0,TuckerCarlson,Tucker Carlson,1064787,"Washington, DC"
3,WhiteHouse,The White House,15327283,"Washington, DC"
5,KellyannePolls,Kellyanne Conway,1664393,"Washington, DC"
6,Reince,Reince Priebus,941927,"Kenosha, WI"
10,RealRomaDowney,Roma Downey,192126,Malibu
11,Trump,Trump Organization,254702,"New York, NY"
14,IngrahamAngle,Laura Ingraham,1664852,"Washington, DC"
16,TeamTrump,Official Team Trump,742514,USA
17,DRUDGE_REPORT,DRUDGE REPORT,1319753,USA
19,LaraLeaTrump,Lara Trump,301998,"New York, NY"


In [19]:
# Ann Coulter also has a double location so I check whether she also appears twice (she does not)
df4[df4.name == 'Ann Coulter']

Unnamed: 0,screen_name,name,followers_count,location
23,AnnCoulter,Ann Coulter,1662294,"Los Angeles/New York, NY"


In [20]:
# I just keep Los Angeles
df4['location'][23] = 'Los Angeles, CA'
df4['location'].unique()

array(['Washington, DC', 'Kenosha, WI', 'Malibu', 'New York, NY', 'USA',
       'Los Angeles, CA', 'United States', 'Peace Within', 'Los Angeles',
       'South Africa', 'Greenwich, Conn.', 'Waikiki, Oahu',
       'Miami, Florida', 'Charlotte, NC', 'Las Vegas ', 'Chicago, IL',
       'Sterling, VA', 'Manhasset, NY', 'London, Newick, LA. ',
       '1600 Pennsylvania Avenue ', 'California, USA',
       'London, Newick, LA.'], dtype=object)

In [21]:
# look for all entities with location equal 'London, Newick, LA. ' or 'London, Newick, LA.'
df4[(df4.location == 'London, Newick, LA. ') | (df4.location == 'London, Newick, LA.')]

Unnamed: 0,screen_name,name,followers_count,location
46,piersmorgan,Piers Morgan,6035181,"London, Newick, LA."
101,piersmorgan,Piers Morgan,6535363,"London, Newick, LA."


In [22]:
# make one in London and one in LA
df4['location'][46] = 'London, UK'
df4['location'][101] = 'Los Angeles, CA'
df4['location'].unique()

array(['Washington, DC', 'Kenosha, WI', 'Malibu', 'New York, NY', 'USA',
       'Los Angeles, CA', 'United States', 'Peace Within', 'Los Angeles',
       'South Africa', 'Greenwich, Conn.', 'Waikiki, Oahu',
       'Miami, Florida', 'Charlotte, NC', 'Las Vegas ', 'Chicago, IL',
       'Sterling, VA', 'Manhasset, NY', 'London, UK',
       '1600 Pennsylvania Avenue ', 'California, USA'], dtype=object)

In [23]:
# Last few changes
df5=df4.replace(regex={r'United States': 'USA', '^Los Angeles$': 'Los Angeles, CA', '1600 Pennsylvania Avenue ': '1600 Pennsylvania Avenue, Washington, DC'})
df5['location'].unique()

array(['Washington, DC', 'Kenosha, WI', 'Malibu', 'New York, NY', 'USA',
       'Los Angeles, CA', 'Peace Within', 'South Africa',
       'Greenwich, Conn.', 'Waikiki, Oahu', 'Miami, Florida',
       'Charlotte, NC', 'Las Vegas ', 'Chicago, IL', 'Sterling, VA',
       'Manhasset, NY', 'London, UK',
       '1600 Pennsylvania Avenue, Washington, DC', 'California, USA'],
      dtype=object)

In [24]:
df5.to_csv('cleanDataFromTweeter.txt')

#### Do the remaining exercises in the CleaningDataExercises notebook