# Introduction

This week we are returning to pandas to look at data cleaning practices in Python. Our goal is to be able to identify and handle common data issues like misformatted strings, missing or duplicated values, and consistency checks. 

The readings for this week are: 
* [Sections 3.4-3.7 and 3.10 of the Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)
* The .pdf in this folder also has several reminders of the syntax



Like all of the other data types and structures we have studied so far, strings have their own sets of functions and methods that can be applied without loading in any additional packages. Beyond basic transformation of the characters, there are also functions for splitting, searching, and slicing strings. 

In [1]:
my_string = 'Hello, world!'

In [2]:
my_string.upper()

'HELLO, WORLD!'

In [3]:
my_string.lower()

'hello, world!'

In [5]:
my_string.title()

'Hello, World!'

In [37]:
"hello, world".startswith("H")

False

In [6]:
split_list = " hello, world".split(',')

In [7]:
split_list

[' hello', ' world']

In [8]:
split_list[0].strip()

'hello'

In [9]:
split_list[1].strip()

'world'

In [49]:
"-".join(split_list)

' hello- world'

In [10]:
my_string.count(", wor")

1

In [11]:
"hello" in my_string

False

In [12]:
"Hello" in my_string

True

In [13]:
my_string.find("world")

7

In [16]:
my_string.find("x")

-1

In [15]:
my_string.index("x")

ValueError: substring not found

In [17]:
number_string = '123,456'

In [20]:
split_number_string = number_string.split(',')

In [22]:
number = int(split_number_string[0]+split_number_string[1])
print(number)

123456


In [25]:
number_string = '1,543,293,593,192,403'

if ',' in number_string: 
    number = ''
    for part in number_string.split(','):
        number += part
    number = int(number)
    print(number)
else:
    print("There are no commas in that number")

1543293593192403


Pandas provides vectorized versions of most of the string functions that can be applied to entire columns as once. These are accessed with the `.str` syntax and mostly behave just as their regular counterparts. 

In [26]:
import pandas as pd

In [29]:
df = pd.read_csv("./Data/string_processing.csv")

In [31]:
df.head()

Unnamed: 0,Name,ZipCode,Snack,Soda,Unnamed: 4
0,SOPHIA GOOD,241481,Eggs,Pepsi,
1,SIANNA SPENCER,918992,Snickers Bars,Coke,
2,LULU BROWNING,477153,Candy Corn,Coca-Cola,
3,MACAULY FROST,529572,Cheetos,Coke,
4,ANNABELLA ALVAREZ,704066,Marmalade,Pepsi-Cola,


In [32]:
df = df[["Name","ZipCode","Snack","Soda"]]

In [33]:
df.head()

Unnamed: 0,Name,ZipCode,Snack,Soda
0,SOPHIA GOOD,241481,Eggs,Pepsi
1,SIANNA SPENCER,918992,Snickers Bars,Coke
2,LULU BROWNING,477153,Candy Corn,Coca-Cola
3,MACAULY FROST,529572,Cheetos,Coke
4,ANNABELLA ALVAREZ,704066,Marmalade,Pepsi-Cola


In [34]:
df["Name"].str.title()

0           Sophia Good
1        Sianna Spencer
2         Lulu Browning
3         Macauly Frost
4     Annabella Alvarez
5        Teodor Charles
6           Aron Battle
7        Sienna Collier
8      Esme-Rose Bright
9          Stacey Morin
10           Carlie Key
11       Alberto Flores
12      Teddie Clarkson
13         Komal Burris
14        Sanna Holcomb
15         Percy Arnold
16         Rhydian Pate
17        Aleena Werner
18            Shay Neal
19       Shayaan Barnes
20         Daria Gibson
21       Jaimee Schmidt
22           Ria Norman
23        Kayley Alford
24           Milosz Day
25      Nadeem Townsend
26     Zunairah Hibbert
27    Tayyab Millington
28          Kelsey Luna
29         Zander Scott
Name: Name, dtype: object

In [35]:
df["New_Name"]=df["Name"].str.title()

In [36]:
df.head()


Unnamed: 0,Name,ZipCode,Snack,Soda,New_Name
0,SOPHIA GOOD,241481,Eggs,Pepsi,Sophia Good
1,SIANNA SPENCER,918992,Snickers Bars,Coke,Sianna Spencer
2,LULU BROWNING,477153,Candy Corn,Coca-Cola,Lulu Browning
3,MACAULY FROST,529572,Cheetos,Coke,Macauly Frost
4,ANNABELLA ALVAREZ,704066,Marmalade,Pepsi-Cola,Annabella Alvarez


In [67]:
(df['Snack'].str.find('Gold')).map({-1:False,0:True})


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20     True
21     True
22     True
23     True
24     True
25    False
26    False
27    False
28    False
29    False
Name: Snack, dtype: bool

In [68]:
df['Snack'].str.find('Gold').map({-1:False,0:True})


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20     True
21     True
22     True
23     True
24     True
25    False
26    False
27    False
28    False
29    False
Name: Snack, dtype: bool

In [69]:
df.loc[df['Snack'].str.find('Gold').map({-1:False,0:True})]


Unnamed: 0,Name,ZipCode,Snack,Soda,New_Name
20,daria gibson,116700,GoldFish,coke,Daria Gibson
21,jaimee schmidt,971168,Goldfish,pepsi-cola,Jaimee Schmidt
22,Ria Norman,503193,Gold fish,Pepsi,Ria Norman
23,Kayley Alford,52248,Gold Fish,Coke,Kayley Alford
24,Milosz Day,64933,Gold Fish,Coca-Cola,Milosz Day


In [65]:
df["Soda"].str.upper().str.startswith("P").replace([True,False],["Pepsi","Coke"])

0     Pepsi
1      Coke
2      Coke
3      Coke
4     Pepsi
5     Pepsi
6      Coke
7      Coke
8      Coke
9      Coke
10    Pepsi
11     Coke
12     Coke
13    Pepsi
14     Coke
15    Pepsi
16     Coke
17     Coke
18    Pepsi
19     Coke
20     Coke
21    Pepsi
22    Pepsi
23     Coke
24     Coke
25     Coke
26     Coke
27     Coke
28    Pepsi
29    Pepsi
Name: Soda, dtype: object

Beyond string operations, there are several other common types of data cleaning issues that pandas allows us to handle in a vectorized fashion. Missing values are represented with NaN (Not a Number) inside the dataframe and frequently one of our first cleaning decisions will be to decide whether we want to remove (drop) the missing data or fill it in with another value. Another common data issue is duplicate values, which may also need to be removed. We also might want to check that our data entries satisfies reasonable consistency checks on both the values and the data types

In [70]:
df = pd.read_csv("./Data/VA_VOTES.csv")

In [71]:
df.head()

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
0,5200653,89.0,4.0,4.0,0.0,100.00%,0.00%
1,5200368,159.0,67.0,45.0,22.0,67.16%,32.84%
2,5200971,4.0,0.0,0.0,0.0,,
3,5200269,112.0,77.0,15.0,62.0,19.48%,80.52%
4,5200516,72.0,62.0,3.0,59.0,4.84%,95.16%


In [72]:
df.loc[2,"R_Percent"]

nan

In [74]:
import numpy as np
np.nan 


nan

In [75]:
df['D_Percent'].isnull()

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12     True
13    False
14    False
15    False
16    False
17     True
18    False
19     True
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28     True
29    False
      ...  
70    False
71    False
72    False
73     True
74    False
75    False
76     True
77    False
78    False
79    False
80    False
81    False
82    False
83    False
84    False
85    False
86     True
87    False
88    False
89    False
90    False
91    False
92     True
93    False
94    False
95    False
96    False
97    False
98    False
99    False
Name: D_Percent, Length: 100, dtype: bool

In [76]:
df[df['D_Percent'].isnull()].index

Int64Index([2, 10, 12, 17, 19, 28, 36, 40, 41, 47, 56, 73, 76, 86, 92], dtype='int64')

In [77]:
df_dropped = df.drop(df[df['D_Percent'].isnull()].index)

In [78]:
df_dropped

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
0,5200653,89.0,4.0,4.0,0.0,100.00%,0.00%
1,5200368,159.0,67.0,45.0,22.0,67.16%,32.84%
3,5200269,112.0,77.0,15.0,62.0,19.48%,80.52%
4,5200516,72.0,62.0,3.0,59.0,4.84%,95.16%
5,5200600,272.0,1.0,1.0,0.0,100.00%,0.00%
6,5200271,35.0,37.0,6.0,30.0,17.14%,82.86%
7,5200753,490.0,352.0,319.0,33.0,90.63%,9.38%
8,5200773,57.0,79.0,55.0,24.0,69.62%,30.38%
9,5200745,7.0,36.0,4.0,3.0,57.14%,42.86%
11,5200408,110.0,75.0,37.0,38.0,49.33%,50.67%


In [81]:
df.fillna(-100)

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
0,5200653,89.0,4.0,4.0,0.0,100.00%,0.00%
1,5200368,159.0,67.0,45.0,22.0,67.16%,32.84%
2,5200971,4.0,0.0,0.0,0.0,-100,-100
3,5200269,112.0,77.0,15.0,62.0,19.48%,80.52%
4,5200516,72.0,62.0,3.0,59.0,4.84%,95.16%
5,5200600,272.0,1.0,1.0,0.0,100.00%,0.00%
6,5200271,35.0,37.0,6.0,30.0,17.14%,82.86%
7,5200753,490.0,352.0,319.0,33.0,90.63%,9.38%
8,5200773,57.0,79.0,55.0,24.0,69.62%,30.38%
9,5200745,7.0,36.0,4.0,3.0,57.14%,42.86%


In [82]:
df_dropped[df_dropped.duplicated()==True]

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
72,5200365,84.0,65.0,40.0,25.0,61.54%,38.46%


In [83]:
df_dropped[df_dropped["BLOCKID"]==5200365]

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
58,5200365,84.0,65.0,40.0,25.0,61.54%,38.46%
72,5200365,84.0,65.0,40.0,25.0,61.54%,38.46%


In [84]:
df_dropped = df_dropped.drop_duplicates()

In [85]:
df_dropped

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
0,5200653,89.0,4.0,4.0,0.0,100.00%,0.00%
1,5200368,159.0,67.0,45.0,22.0,67.16%,32.84%
3,5200269,112.0,77.0,15.0,62.0,19.48%,80.52%
4,5200516,72.0,62.0,3.0,59.0,4.84%,95.16%
5,5200600,272.0,1.0,1.0,0.0,100.00%,0.00%
6,5200271,35.0,37.0,6.0,30.0,17.14%,82.86%
7,5200753,490.0,352.0,319.0,33.0,90.63%,9.38%
8,5200773,57.0,79.0,55.0,24.0,69.62%,30.38%
9,5200745,7.0,36.0,4.0,3.0,57.14%,42.86%
11,5200408,110.0,75.0,37.0,38.0,49.33%,50.67%


In [86]:
df_dropped[df_dropped["BLOCKID"].duplicated(keep=False)]

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent
8,5200773,57.0,79.0,55.0,24.0,69.62%,30.38%
15,5200773,87.0,79.0,55.0,24.0,69.62%,30.38%
16,5200382,63.0,24.0,5.0,29.0,20.83%,79.17%
24,5200754,185.0,49.0,33.0,16.0,67.35%,32.65%
51,5200779,27.0,6.0,4.0,2.0,66.67%,33.33%
55,5200382,30.0,15.0,4.0,11.0,26.67%,73.33%
85,5200754,,32.0,14.0,18.0,43.75%,56.25%
96,5200779,364.0,131.0,63.0,68.0,48.09%,51.91%


In [87]:
df_dropped = df_dropped.drop(8)

In [88]:
df_dropped = df_dropped.drop(85)

In [89]:
len(df_dropped["BLOCKID"].unique())

80

In [90]:
df_dropped['pop check'] = df["Population"] > df["Voters"]

In [91]:
df_dropped.loc[df_dropped['pop check']==False]

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent,pop check
6,5200271,35.0,37.0,6.0,30.0,17.14%,82.86%,False
9,5200745,7.0,36.0,4.0,3.0,57.14%,42.86%,False
60,5200934,,7.0,2.0,5.0,28.57%,71.43%,False
93,5200455,40.0,,4.0,8.0,33.33%,66.67%,False


In [92]:
df_dropped.loc[df_dropped["D_GOV"]+df_dropped["R_GOV"] != df_dropped["Voters"]]

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent,pop check
6,5200271,35.0,37.0,6.0,30.0,17.14%,82.86%,False
9,5200745,7.0,36.0,4.0,3.0,57.14%,42.86%,False
16,5200382,63.0,24.0,5.0,29.0,20.83%,79.17%,True
21,5200534,100.0,62.0,16.0,44.0,25.81%,74.19%,True
26,5200487,112.0,18.0,,16.0,11.11%,88.89%,True
46,5200829,279.0,237.0,-170.0,-67.0,71.73%,28.27%,True
50,5200790,150.0,84.0,-83.0,1.0,98.81%,1.19%,True
54,5200028,84.0,48.0,38.0,9.0,81.25%,18.75%,True
71,5200529,212.0,190.0,135.0,45.0,78.95%,21.05%,True
93,5200455,40.0,,4.0,8.0,33.33%,66.67%,False


In [93]:
df_dropped.loc[26,"D_GOV"] =2

In [94]:
df_dropped.loc[26]

BLOCKID       5200487
Population        112
Voters             18
D_GOV               2
R_GOV              16
D_Percent      11.11%
R_Percent      88.89%
pop check        True
Name: 26, dtype: object

In [95]:
df_dropped["D_GOV"] = df_dropped["D_GOV"].abs()

In [96]:
df_dropped["R_GOV"] = df_dropped["R_GOV"].abs()

In [97]:
df_dropped.loc[df_dropped["D_GOV"]+df_dropped["R_GOV"] != df_dropped["Voters"]]

Unnamed: 0,BLOCKID,Population,Voters,D_GOV,R_GOV,D_Percent,R_Percent,pop check
6,5200271,35.0,37.0,6.0,30.0,17.14%,82.86%,False
9,5200745,7.0,36.0,4.0,3.0,57.14%,42.86%,False
16,5200382,63.0,24.0,5.0,29.0,20.83%,79.17%,True
21,5200534,100.0,62.0,16.0,44.0,25.81%,74.19%,True
54,5200028,84.0,48.0,38.0,9.0,81.25%,18.75%,True
71,5200529,212.0,190.0,135.0,45.0,78.95%,21.05%,True
93,5200455,40.0,,4.0,8.0,33.33%,66.67%,False
97,5200929,284.0,240.0,66.0,180.0,25.00%,75.00%,True
