<br> 
<center><img src="https://i.imgur.com/hkb7Bq7.png" width="500"></center>

# [Session 1: Introduction to Python](https://github.com/eScienceWinterSchool/RRStudioSession)


_____

### Prof. José Manuel Magallanes, PhD

* Associate Professor, Departamento de Ciencias Sociales, Pontificia Universidad Católica del Perú, [jmagallanes@pucp.edu.pe](mailto:jmagallanes@pucp.edu.pe)

* Visiting Associate Professor, Evans School of Public Policy and Governance / Senior Data Science Fellow, eScience Institute, University of Washington, [magajm@uw.edu](mailto:magajm@uw.edu)
_____



## Data Structures

Python has basic native structures, like lists, tuples and dictionaries.

* **LISTS** are the most flexible structure.

In [53]:
names=["Qing", "Françoise", "Raúl", "Bjork","Marie"]
ages=[32,33,28,30,29]
country=["China", "Senegal", "Spain", "Norway","Korea"]
education=["Bach", "Bach", "Master", "PhD","PhD"]

Above we have created some lists. Lists can contain any values. Lists support different operations:

* Accessing:

In [54]:
# one element
ages[0]

32

In [55]:
# several, using slices:
ages[1:-1] #second to before last

[33, 28, 30]

In [56]:
# several, using slices:
ages[:-2] #all but two last ones

[32, 33, 28]

* Modifying:

In [57]:
# by position
country[2]="España"

# list changed:
country

['China', 'Senegal', 'España', 'Norway', 'Korea']

In [58]:
# by value
country=["PR China" if x == "China" else x for x in country]

# list changed:
country

['PR China', 'Senegal', 'España', 'Norway', 'Korea']

* Deleting

In [59]:
# by position
del country[-1]
# list changed:
country

['PR China', 'Senegal', 'España', 'Norway']

In [60]:
# by position
names.pop(-1)
# list changed:
names

['Qing', 'Françoise', 'Raúl', 'Bjork']

In [61]:
# only 'del' works for several positions

lista=[1,2,3,4,5,6]
del lista[1:3]
lista

[1, 4, 5, 6]

In [62]:
# by value
ages.remove(29) 
# list changed:
ages # just first ocurrence!!

[32, 33, 28, 30]

In [63]:
# by value
education.remove('PhD') 
# list changed:
education # just first ocurrence!!

['Bach', 'Bach', 'Master', 'PhD']

In [64]:
# all values:

lista=[1,'a',45,'b','a']
lista=[x for x in lista if x!='a']
lista

[1, 45, 'b']

* Inserting value

In [68]:
# at the end
lista.append("abc")
lista

[1, 45, 'b', 'abc']

In [66]:
# insert in some other place
# first delete
education.pop(2)
education

['Bach', 'Bach', 'PhD']

In [67]:
# now insert
education.insert(2,"Master")
education

['Bach', 'Bach', 'Master', 'PhD']

* **TUPLES** are inmutable structures in Python, they look like lists:

In [70]:
# new list:
weekend=("Friday", "Saturday", "Sunday")

You can access:

In [71]:
weekend[0]

'Friday'

But no other operation is allowed.

Python itself uses tuples as output of some important functions:

In [72]:
zip(names,ages)

<zip at 0x10dfcf820>

The **zip** functions creates tuples, by combining in parallel. You can see it if you turn the result into a list:

In [73]:
list(zip(names,ages))  # a list of tuples

[('Qing', 32), ('Françoise', 33), ('Raúl', 28), ('Bjork', 30)]

* **DICTIONARIES**  (or *dicts*) work in a more sophisticated way, as they have a **'key'**:**'value'** structure:

In [76]:
classroom={'student':names,'age':ages,'edu':education,'country':country}
# see it:

classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'España', 'Norway']}

Dictionaries do not use indexes to access values:

In [75]:
classroom[0]

KeyError: 0

It uses keys:

In [77]:
classroom['student']

['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie']

Notice I created a dictionary where the value is not ONE but a LIST of values.

You see that I have created a dictionary where each element is a list.

* **DATA FRAMES**

**Data frames**  are more complex containers of values. The most common analogy is a spreadsheet. To create a data frame, we need to call **pandas**:

In [78]:
import pandas

We can prepare a data frame from a dictionary:

In [79]:
classroom

{'student': ['Qing', 'Françoise', 'Raúl', 'Bjork', 'Marie'],
 'age': [32, 33, 28, 30],
 'edu': ['Bach', 'Bach', 'Master', 'PhD'],
 'country': ['PR China', 'Senegal', 'España', 'Norway']}

In [86]:
#if we had the same amount of elements:
students=pandas.DataFrame(classroom)
# we get error:
students

ValueError: arrays must all be same length

In [85]:
#then

students=pandas.DataFrame({key:pandas.Series(value) for key, value in classroom.items()})

# seeing it:
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


Sometimes, Python users code like this:

In [87]:
import pandas as pd # renaming the library

students=pd.DataFrame({key:pd.Series(value) for key, value in classroom.items()})
students

Unnamed: 0,student,age,edu,country
0,Qing,32.0,Bach,PR China
1,Françoise,33.0,Bach,Senegal
2,Raúl,28.0,Master,España
3,Bjork,30.0,PhD,Norway
4,Marie,,,


It is important to know what you have:

In [88]:
type(students)

pandas.core.frame.DataFrame

You can get more information on the data types like this (as _str()_ in R):

In [89]:
students.dtypes

student     object
age        float64
edu         object
country     object
dtype: object

The _info()_ function can get you more details:

In [90]:
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
student    5 non-null object
age        4 non-null float64
edu        4 non-null object
country    4 non-null object
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


The data frames in pandas behave much like in R:

In [92]:
#one particular column
students.student

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [93]:
# or
students['student'] # it is not the same as: students[['names']]

0         Qing
1    Françoise
2         Raúl
3        Bjork
4        Marie
Name: student, dtype: object

In [94]:
# it is not the same as: 
students[['student']] # a data frame, not a column (or series)

Unnamed: 0,student
0,Qing
1,Françoise
2,Raúl
3,Bjork
4,Marie


In [95]:
# two columns
students.iloc[:,[1,3]]  

Unnamed: 0,age,country
0,32.0,PR China
1,33.0,Senegal
2,28.0,España
3,30.0,Norway
4,,


In [96]:
# this is also a DF
students[['country','student']]

Unnamed: 0,country,student
0,PR China,Qing
1,Senegal,Françoise
2,España,Raúl
3,Norway,Bjork
4,,Marie


In [97]:
## Using positions is the best way to get several columns:
students.iloc[:,1:4]

Unnamed: 0,age,edu,country
0,32.0,Bach,PR China
1,33.0,Bach,Senegal
2,28.0,Master,España
3,30.0,PhD,Norway
4,,,


Deleting a column:

In [99]:
# This is what you want get rid of:
byeColumns=['edu']

#this would change the original: students.drop(byeColumns,axis=1,inplace=False)
studentsNoEd=students.drop(byeColumns,axis=1)

# this is a new DF
studentsNoEd

Unnamed: 0,student,age,country
0,Qing,32.0,PR China
1,Françoise,33.0,Senegal
2,Raúl,28.0,España
3,Bjork,30.0,Norway
4,Marie,,


You can modify any values in a data frame. Let me create a **deep** copy of this data frame to play with:

In [None]:
studentsCopy=students.copy()
studentsCopy

Then,

In [None]:
# I can change the age of Qing to 23 replacing 32:
studentsCopy.iloc[0,1]=23 # change is immediate! (no warning)
studentsCopy

In [None]:
# I can reset a column as **missing**:
studentsCopy.country=None
studentsCopy

In [None]:
# And, delete a column by droping it:
studentsCopy.drop(index=['ages'],axis=1,inplace=True) # axis=1 is column
studentsCopy

One important detail when erasing rows, is to reset the indexes:

In [None]:
# another copy for you to see the difference:
studentsCopy2=students.copy()
studentsCopy2

In [None]:
# drop third row 
studentsCopy2.drop(index=2,axis=0) 

Here you can see the difference between **loc** and **iloc**:

In [None]:
studentsCopy2.drop(index=2,axis=0,inplace=True) 
studentsCopy2.iloc[2,:]

In [None]:
studentsCopy2.loc[2,:]

In [None]:
studentsCopy2.loc[3,:]

In [None]:
studentsCopy2.iloc[3,:]

If you need that both give the same result, you need to reset the index:

In [None]:
studentsCopy2=students.copy() # just the copy
studentsCopy2.drop(2,0,inplace=True) # deleting row
studentsCopy2.reset_index(drop=True,inplace=True) # resetting index

# you get:
studentsCopy2

Pandas offers some practical functions:

In [None]:
# rows and columns
studentsCopy2.shape # dim(meals) in R

In [None]:
# length:
len(studentsCopy2) # length in R gives number of columns, here you get number of rows.

You also have _tail_ and _head_ functions in Pandas, to get some top or bottom rows:

In [None]:
students.head(2) #and students.tail(2)

You can also see the column names like this:

In [None]:
# similar to names() in R
students.columns

If you needed the column names as a list:

In [None]:
students.columns.tolist()

# or simply:
# list(students)

If you needed a column values as a list:

In [None]:
students.ages.tolist()

# or simply:
# list(students.ages)

## Data Pre processing

<a id='beginning'></a>

Preprocessing includes two stages:

1. [Cleaning](#part2) 
2. [Formatting](#part3) 

_____
<a id='part2'></a>

### Cleaning

This commmon webpage has a table that may be needed:

In [None]:
wikiLink="https://en.wikipedia.org/wiki/List_of_freedom_indices"


import IPython
iframe = '<iframe src=' + wikiLink + ' width=700 height=350></iframe>'
IPython.display.HTML(iframe)

Let's try to get the sortable table using pandas:

In [None]:
import pandas as pd

wikiTables=pd.read_html(wikiLink,header=0,attrs={'class': 'wikitable sortable'})

I tried to get all those tables. I might have more than one:

In [None]:
# What do I have? / How many?
type(wikiTables), len(wikiTables) 

I need to recover the first table from the list (the only one).

In [None]:
DF=wikiTables[0]

#what is it?
type(DF)

Great!...we have a data frame; then:

In [None]:
DF.head()

This data frame does not look like the one we see on the website. We need to improve the call:

In [None]:
# install 'beautifulsoup4'
DF=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]
DF.head()

Combining BeautifulSoup (BS) and Pandas gave us the right result. But our work is not over.

Pay attention to the cleaning pandas+BS have done: the 'n/a' was interpreted as **NaN**; no country flags in the data; and the headers are in the right place. 

However, to prepare a final data set, we should pay attention to the headers names to avoid _blanks_, and erase the _footnote_ call.

We can have two strategies:
* Brute-force!

In [None]:
# if we had a small number of names to change, we can use brute-force strategy:
DF.columns=['Country',
 'FreedomintheWorld',
 'IndexofEconomicFreedom',
 'PressFreedomIndex',
 'DemocracyIndex']
DF.head()

* Using more computational thinking (algorithmic):

In [None]:
# if we had many columns, writing an algorith to rename the columns could be better:

# recalling the data:
DF=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]
DF.columns

I just recalled the data to do several steps:

1. Find blanks.
2. Find numbers.
3. Find brackets (opening and closing).

The previous requires a **regular expresssion**:

In [None]:
import re  # may need to be installed:

# find blanks: \\s+
# find numbers: \\d+
# find opening bracket : \\[
# find closing bracket: \\]

# You can combine using '|' (or):
pattern='\\s+|\\d+|\\[|\\]'
nothing=''

Now, let's see how this works for one case:

In [None]:
testString='Freedom in the World 2018[10]'
re.sub(pattern,nothing,testString)

Now, let's see how this works for ALL cases:

In [None]:
[re.sub(pattern,nothing,name) for name in DF.columns]

We can verify we are matching well:

In [None]:
newNames=[re.sub(pattern,nothing,name) for name in DF.columns]

# checking:
list(zip(DF.columns,newNames))

Let's turn that match into a dictionary:

In [None]:
{old:new for old,new in zip(DF.columns,newNames)}

Once you have a dict like that one, you can use it to rename the columns with another function:

In [None]:
changes={old:new for old,new in zip(DF.columns,newNames)}

DF.rename(columns=changes,inplace=True)

If you had a set of new names, and you do not want to change every column name, that is the correct way to do it.

Let's see the result:

In [None]:
DF.head()

A next step will be verifying if the answers are well coded:

In [None]:
DF.iloc[:,1::].describe()

What were you looking for? 
Sometimes a category may be wrongly written in a cell, for instance, if you had 'Free' and 'free' or 'free ' to represent the same in one column, you have a mistake. Let's see if there is one here:

In [None]:
DF.FreedomintheWorld.value_counts()

 What we see is that this variable has its own correct set of answers. 
 
 We can try that approach for each variable, but we can check the whole group of categorical values like thisL

In [None]:
# DF.iloc[:,1::] all columns but the first one
# apply(set)  apply the function 'set()'  per column (get unique values)
# tolist() convert to a list 

DF.iloc[:,1::].apply(set).tolist()

[Go to page beginning](#beginning)
____
<a id='part3'></a>
### Formatting

The data seems _clean_, but we need now to be sure the information is in the right format. This varies according to the project; so, let me show you some steps during of the formatting stage.

1. Verify the data types:


In [None]:
DF.dtypes

All but the first variable are categories, not text (_object_). To convert them into categories you can do this:

In [None]:
headerNames=DF.columns
DF[headerNames[1:]]=DF[headerNames[1:]].astype('category')

When a variable is of categorical type, you can use particular functions for them:

In [None]:
DF.FreedomintheWorld.cat.categories

In [None]:
DF.IndexofEconomicFreedom.cat.categories

In [None]:
DF.PressFreedomIndex.cat.categories

In [None]:
DF.DemocracyIndex.cat.categories

2. If ordinal, make the adjustment.

The order in which the categories differentiate a plain categorical from an ordinal categorical. They should be categorical but the order does not reflect the order it should. 

We can turn it into an ordinal doing the following:

a. Find a good numeric sequence for the ordinal values:

In [None]:
# notice I am using the numbers in the same order as the list of categorical values:
oldFree=list(DF.FreedomintheWorld.cat.categories)

# '5 very good' / '4 good' / '3 middle' / '2 bad' / '1 very bad'

newFree=[5,1,3]
recodeFree={old:new for old,new in zip (oldFree,newFree)}

oldEco=list(DF.IndexofEconomicFreedom.cat.categories)
newEco=[5,3,4,2,1]
recodeEco={old:new for old,new in zip (oldEco,newEco)}

oldPress=list(DF.PressFreedomIndex.cat.categories)
newPress=[2,5,3,4,1]
recodePress={old:new for old,new in zip (oldPress,newPress)}

oldDemo=list(DF.DemocracyIndex.cat.categories)
newDemo=[1,4,5,2]
recodeDemo={old:new for old,new in zip (oldDemo,newDemo)}

b. Rename the still plain categorical:

In [None]:
DF.FreedomintheWorld.cat.rename_categories(recodeFree,inplace=True)

DF.IndexofEconomicFreedom.cat.rename_categories(recodeEco,inplace=True)

DF.PressFreedomIndex.cat.rename_categories(recodePress,inplace=True)

DF.DemocracyIndex.cat.rename_categories(recodeDemo,inplace=True)

# veamos:
DF.head(10)

c. Now turn the renamed columns into a numeric values:

In [None]:
DF[headerNames[1:]]=DF[headerNames[1:]].apply(pd.to_numeric)

Let me verify:

In [None]:
DF.head()

3. Try solving missing data presence

The data has some missing data:

In [None]:
DF.info()

Now comes the thinking: How to replace the missing values?

Python can easily find and replace every missing value; but our strategy will be different:

* _Freedom in the World_ has the least missing values, we will use this variable to see how the others behave.

* Since the variables are ordinals (even though they are numbers now) a good candidate to impute a missing is the median NOT the mean (you can not compute the mean of an ordinal).

Let's see:

In [None]:
#median per group: 
DF.groupby('FreedomintheWorld')[headerNames[2:]].median()

We need to replace those medians whenever a missing value is found:

In [None]:
for col in headerNames[2:]:
    # in each column, get median by FIW group, and use it to replace the missing values.
    DF[col].fillna(DF.groupby(["FreedomintheWorld"])[col].transform("median"), inplace=True)

In [None]:
DF.head(20)

We can send this to R, in a simple CSV format:

In [None]:
#DF.to_csv("indexes.csv",index=None)

______

## More examples

### Case: Democracy Index

Let me clean a similar data from wikipedia, about democracy index:

In [None]:
import pandas as pd #location:
demoLink = "https://en.wikipedia.org/wiki/Democracy_Index" 

#collection
demodex=pd.read_html(demoLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]

1. Looking for messiness:

In [None]:
# what's on top?
# names? weird symbols? more links?
demodex.head(10)

In [None]:
# what's at the bottom?
# note? credits? extra info?

demodex.tail(10)

First, we see a column that have some messiness (symbol "=" in rank), but which can be deleted as their information is not relevant. Let me get rid of the _Score_, as it is just the mean of the other ones. The last row is the repetition of the headers, so that one should go, too:

In [None]:
#bye row 167, and two columns
demodexClean=demodex.drop(index=167,columns=['Rank','Score'])

In [None]:
demodexClean

As there are few names, we can change to smaller sizes:

In [None]:
newNames=['pluralism','effectiveness','participation','culture','liberties']

# names from the second and before the last one '[1:-1]':
newMapper={old:new for old,new in zip(demodexClean.columns[1:-1],newNames)}

demodexClean.rename(columns=newMapper,inplace=True)

In [None]:
# this is what we have so far:
demodexClean.head()

It looks good so far. Let's go to formatting.

2. Giving the rigth format:

In [None]:
# checking data types:
demodexClean.dtypes

Above, we realized the need to make some indices into numeric:

In [None]:
demodexClean[newNames]=demodexClean[newNames].apply(pd.to_numeric)

The last one is a categorical variable:

In [None]:
demodexClean.Category.value_counts()

When you have text, you could get the unique values of a column like this:

In [None]:
pd.unique(demodexClean.Category).tolist()

Then, you can prepare the map to recode the values:

In [None]:
oldValues=pd.unique(demodexClean.Category).tolist()
newValues=[4,3,2,1]
mapNewOld={old:new for old,new in zip(oldValues,newValues)}
mapNewOld

You can do it in this way:

In [None]:
demodexClean.Category.replace(mapNewOld,inplace=True)

In [None]:
# or this one:
# demodexClean.Category=demodexClean.Category.replace(mapNewOld)

You can save it as a category, but that will be lost if sent to R:

In [None]:
demodexClean.Category=demodexClean.Category.astype('category')

In [None]:
demodexClean['Category'].cat.categories

In [None]:
# checking missing values

demodexClean.info()

This data is now ready for R.

In [None]:
# demodexClean.to_csv("democracyIndex.csv",index=None)

### The case of Medicare:

Here I will use data from [Medicare Beneficiary Enrollment and Demographics](https://dev.socrata.com/foundry/data.wa.gov/2cup-2fnu)

In [None]:
import requests

# This time I am talking to the API from DATA.WA.GOV
url = "https://data.wa.gov/resource/2cup-2fnu.json?year=2014"
response = requests.get(url)
if response.status_code == 200:
    medicare = response.json()

In [None]:
# turning json into DF:
medicare2014 = pd.DataFrame(medicare)

In [None]:
medicare2014.head()

In [None]:
medicare2014.tail()

The first row is the total, it has to go:

In [None]:
#this one?
medicare2014.drop(index=0).head()

In [None]:
#or this one?
medicare2014.drop(index=0).reset_index().head()

In [None]:
#or this one?
medicare2014.drop(index=0).reset_index(drop=True).head()

When we use inplace, we should not concatenate:

In [None]:
medicare2014.drop(index=0,inplace=True)
medicare2014.reset_index(drop=True,inplace=True)

The result so far:

In [None]:
medicare2014.head()

In [None]:
# what we have
medicare2014.dtypes

Notice that the three variables before the last one, and county should be kept as objects, while the other should be numeric:

In [None]:
# get original order:
original=medicare2014.columns.tolist()
original

In [None]:
# new order:  (no need for * if one element)
newOrder=[original[3],*original[14:], *original[0:3],*original[4:14],] # using '*'
newOrder

In [None]:
# moving columns:

medicare2014=medicare2014[newOrder]
medicare2014.head()

2. Formatting

We can give the right format now:

In [None]:
headerNames=medicare2014.columns
medicare2014[headerNames[4::]]=medicare2014[headerNames[4::]].apply(pd.to_numeric)

In [None]:
#check data types:
medicare2014.dtypes

We can explore the variables:

In [None]:
medicare2014.describe(include='all') # to include categorical

In [None]:
medicare2014.info()

There are some missing values, but we will leave it so. So the last step will be just to save the file:

In [None]:
# medicare2014.to_csv("medicare2014.csv",index=None)

### Case: Public education:

When you visit the [website](https://nces.ed.gov/ccd/) of the Common Core of Data from the US Department of Education, you can get a data set with detailed information on public schools at the state of Washington:

In [None]:
dataFile='https://github.com/EvansDataScience/data/raw/master/wapubs.xlsx'
schoolPub=pd.read_excel(dataFile) 

1. Looking for messiness:

In [None]:
schoolPub.head(20)

The first row is not the beginning of the table. We need to skip 11 rows; but pay attention to what you are deleting, as if is telling you how missing values were coded.

In [None]:
schoolPub=pd.read_excel(dataFile,skiprows=11,na_values=['†','‡','–'])
schoolPub.head()

In [None]:
#checking the tail:
schoolPub.tail()

The headers have blanks and symbols, getting rid of them here:

In [None]:
import re  

pattern='\\*|\\s+'
nothing=''
schoolPub.columns=[re.sub(pattern,nothing,columnName) for columnName in schoolPub.columns]
schoolPub.columns

Clean names allow better exploring. Notice we solved the missing values above. You could have done this instead:

In [None]:
#symbolsForNA=['†','‡','–'] 

#import numpy as np  #numpy manages the nan for pandas
#schoolPub.replace(symbolsForNA,np.nan,inplace=True) # in the whole data frame!!

2. Formatting

In [None]:
schoolPub.dtypes

Even though we cleaned the missing values, there might be more in the text columns that may be hidden. Obviously, 'SchoolName','District','CountyName','StreetAddress','City','State' are text, but the other are possibly categorical.

So let me explore all the other ones, which are of type _object_:

In [None]:
notUsed=['SchoolName','District','CountyName','StreetAddress','City','State']
 
# These are the ones without the obvious text columns
schoolPub.drop(notUsed,axis=1).head()

In [None]:
# # These are the ones without the obvious text columns, but of the type 'object':
schoolPub.drop(notUsed,axis=1).select_dtypes(include='object').head()

We need to see the categories there:

In [None]:
schoolPub.drop(notUsed,axis=1).select_dtypes(include='object').apply(set).tolist()

We need to take care of the missing value '**N**':

In [None]:
schoolPub.Locale.value_counts(dropna=False)

Then:

In [None]:
import numpy as np  #numpy manages the nan for pandas

schoolPub.replace(['N'],np.nan,inplace=True) # in the whole data frame!!

In [None]:
# So:
schoolPub.Locale.value_counts(dropna=False)

Another important step could be to give add some text to make the school grades a recognizable ordering (considering the file will be read in R:

In [None]:
# this is wrong:
'PK'<'KG'<'01'

In [None]:
# this is OK:
'-1 PK'<'0 KG'<'01'

In [None]:
# using replace:

schoolPub.replace({'PK':"-1 PK", "KG":"0 KG"},inplace=True)

Unless you want to recode other [variables](https://nces.ed.gov/programs/edge/docs/LOCALE_CLASSIFICATIONS.pdf), we could save this file:

In [None]:
# schoolPub.to_csv("schoolPub.csv",index=None)

### Case: SNAP

In [None]:
import pandas as pd
dataFile="https://github.com/EvansDataScience/data/raw/master/cntysnap.xls"
snapBen=pd.read_excel(dataFile)

In [None]:
# first rows:
snapBen.head()

We need to skip some rows:

In [None]:
# skipping:

snapBen=pd.read_excel(dataFile,skiprows=2)
snapBen.head()

In [None]:
# check the tail
snapBen.tail()

In [None]:
# checking names:
snapBen.columns

In [None]:
# getting rid of blanks:

pattern='\\s+'
nothing=''
snapBen.columns=[re.sub(pattern,nothing,columnName) for columnName in snapBen.columns]

There is a zero FIPS code, take a look:

In [None]:
snapBen[snapBen['CountyFIPScode']==0]

Those are rows about States. I will keep the counties:

In [None]:
snapBenUSCounties=snapBen[snapBen['CountyFIPScode']!=0]

In [None]:
# checking data types:
snapBenUSCounties.dtypes

The counties tell you to what State they belong, so we could use that to create a new column. Let's see a simple example on how to get information from a text:

In [None]:
# using split,a function for strings:
'Autauga County, AL'.split(', ') # notice the space after the comma
# you get a list:

The **split**, in this case, returns the state in the second position of the list (index=1), then:

In [None]:
# saving every second element for each element in the column:
states=[element.split(', ')[1] for element in snapBenUSCounties.Name]

# make that list a new column
snapBenUSCounties=snapBenUSCounties.assign(StateName=states)

# checking:
snapBenUSCounties

The new column was created. We could get rid of the state information from the counties column:

In [None]:
# just keep county names
counties=[element.split(', ')[0] for element in snapBenUSCounties.Name]
snapBenUSCounties=snapBenUSCounties.assign(Name=counties)

In [None]:
# quick look:

snapBenUSCounties.head() # last column will be ate the end...

We can have a better column order:

In [None]:
oldNames=snapBenUSCounties.columns.tolist()
oldNames

In [None]:
newNames=[*oldNames[:2],oldNames[-1],*oldNames[2:-1]]
newNames          

In [None]:
# reordering columns:

snapBenUSCounties=snapBenUSCounties[newNames]
snapBenUSCounties.head()

In [None]:
# JUST SAVING...
#snapBenUSCounties.to_csv("snapBenUSCounties.csv",index=None)

### Case: Multiple data sets

In [None]:
corruptLink='https://raw.githubusercontent.com/EvansDataScience/data/master/corruption.csv'
econoLink='https://raw.githubusercontent.com/EvansDataScience/data/master/economic.csv'
enviroLink='https://raw.githubusercontent.com/EvansDataScience/data/master/environment.csv'
pressLink='https://raw.githubusercontent.com/EvansDataScience/data/master/pressfreedom.csv'

* The _corruptlink_ has data about the _Corruption Perception Index_ (CPI) produced by [Transparency International](https://www.transparency.org/).

* The _econoLink_ has data about the _Economic Freedom Index_ (EFI) produced by [Fraser Institute](https://www.fraserinstitute.org).

* The _enviroLink_ has data about the _Environment Performance Index_ (EPI) produced by [Yale University and Columbia University in collaboration with the World Economic Forum](https://epi.envirocenter.yale.edu/).

* The _pressLink_ has data about the _World Press Freedom Index_ (WPFI) produced by [Reporters Without Borders](https://rsf.org/en/world-press-freedom-index).


In this case, I want to join them (not concatenate):

In [None]:
import pandas as pd
corrupt=pd.read_csv(corruptLink,encoding='Latin-1')
econo=pd.read_csv(econoLink,encoding='Latin-1')
enviro=pd.read_csv(enviroLink,encoding='Latin-1')
press=pd.read_csv(pressLink,encoding='Latin-1')

As each data set has a differing amount of rows (countries), and possibly a different way to name each one, the result will be far from perfect:

In [None]:
join1=pd.merge(corrupt,econo)
join2=pd.merge(press,enviro)
indexes=pd.merge(join1,join2)

As always it is good to verify the data types:

In [None]:
indexes.dtypes

And check descriptives:

In [None]:
indexes.describe(include='all') 

In [None]:
indexes.head()

There is some formatting needed:

Let's order it:

In [None]:
oldCols=indexes.columns.tolist()
oldCols

When we do not have slices, there is extra work:

In [None]:
numericIndex=[oldCols[i] for i in [1,3,4,6]]
numericIndex

In [None]:
newValues=[oldCols[0],oldCols[2],*numericIndex,oldCols[5],oldCols[7]]
newValues

Then, the new order will be:

In [None]:
indexes=indexes[newValues]
indexes.head()

There are several numeric values. Let's see a summary:

In [None]:
indexes.describe()

It is important to find some monotony issues in these values:

In [None]:
% matplotlib inline
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(indexes.iloc[:,2:6])
plt.show()

Score press is negatively correlated to the rest. That means that the score for that column needs to be reversed:

In [None]:
# creating reversing function:
def reverse(aColumn):
    return max(aColumn) - aColumn + min(aColumn)

In [None]:
# reversing using function:
indexes.scorepress=reverse(indexes.scorepress)

We should see a different result:

In [None]:
pd.plotting.scatter_matrix(indexes.iloc[:,2:6])
plt.show()

The variable _presscat_ needs to be an ordinal factor.

In [None]:
indexes['presscat'].value_counts()

In [None]:
indexes['presscat'].replace({'Medium':2, "High":3, "Low":1},inplace=True)

In [None]:
indexes['presscat'].value_counts(sort=False)

The numbers will help R users when they set it as an ordinal. You can convert them to ordinal, but that information will be lost in R.

In [None]:
indexes.head()

We are proposing that the categories coded as numbers follow an asceding format, then let's check if _environmentCat_ should be changed:

In [None]:
indexes['environmentCat'].value_counts()

As there is no need for that, just save the file:

In [None]:
# indexes.to_csv("indexes.csv",index=None)


____

* [Go to page beginning](#beginning)
* [Go to REPO in Github](https://github.com/eScienceWinterSchool/PythonSession)
* [Go to WinterSchool Repos](https://github.com/eScienceWinterSchool)