<a href="https://colab.research.google.com/github/dgullate/Curso-IE/blob/master/Week_1/pandas_completo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas
![alt text](https://pandas.pydata.org/_static/pandas_logo.png)

Pandas is a python library with high level data structures that have many functionalities and are easy to use for the purpose of data analytics.
The main objects are called `dataframes`, which are 2-dim data structures similar to a table (and to dataframes in R).

In [None]:
!git clone https://github.com/dgullate/Curso-IE.git

In [None]:
import pandas as pd 
import numpy as np

## Creating dataframes

We can create a dataframe from various Series, each of them will be a register in our table. For instance, let's create a table with 3 registries whose columns are `Cost`, `Item Purchased` y `Name`:

In [None]:

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])

As we see, it is enough to define a Series object for each entry in our table. Moreover, the data types must be consistent on each Series object, so that the table get the right formatting across columns. The `index` argument specifies identifiers for each item. By defect, an index is an integer number, but in this case we used strings.

In [None]:
df.head()

We can select a subset of elements knowing its index, using `loc`.

In [None]:
df.loc['Store 1']

If moreover we want to select a specific columns, we can do so separating with a comma.

In [None]:
df.loc['Store 1', 'Cost']

Even more ! these arguments can be both lists.

In [None]:
df.loc[['Store 1', 'Store 2'], ['Cost','Name']]

If instead of accesing via the index, we would like to access by the row or column positions, we can do so using `iloc`. As an example, the following command selects the first 2 rows and all columns.


In [None]:
df.iloc[:3, :]

<br>
If we want to erase some item knowing its index, we can do so using `drop`

In [None]:
df.drop('Store 1')

In [None]:
df

What happened ? Didn't we erase it ? The thing is `drop` by default creates a new copy of the dataframe, instead of assigning a new variable `df2 = df.drop(...)` we can specify that the updated dataframe overwrites the same object in memory by using the flag `inplace=True` .

In [None]:
df.drop('Store 1', inplace=True)
df

## Loading dataframes

Most of the times we will not work with df in this manner, as they will be lying on some file in a hard disk or web server. Pandas has functions that are able to read and load the contents of these files, and create a dataframe object from them.
All of these functions start like `pd.read_...` and there are versions ofr file types in excel, html, json, etc. We will use the most common function  `pd.read_csv`, that reads .csv files(every item int he dataframe is a line in the csv files, and vaues are usually comma separated).

Next, we load the data contained in the file `tips.csv` in a  `pandas`  `dataframe` . This database contains information on the cost of meals and tips in restaurants. It has the following variables:

* `total_bill`: total proce of the meal.
* `tip`: amount given as tip.
* `sex`: Gender of the paying person (Female/Male).
* `smoker`: Categorical variable indicating whether the payer is a smoker (yes/no).
* `day`: Day of thre week.
* `time`: Categorical variable indicating if it was lunch or dinner.
* `size`: Number of people at the table

In [None]:
#dir='/content/Curso-IE/Week_1/data/tips.csv'
dir='data/tips.csv'
df = pd.read_csv(dir)

Besides the name of the file, a necessary argument, `read_csv` has other optional args. The most common are:
    
* nrows: only reads a certain number of lines (useful to test before we load a very large file).
* usecols: the dataframe will only red these columns.
* dtype: specifies the data type for each column (by defect, pandas tries to infer the type automatically by looking at the data).

## Exploratory analysis

Once we have loaded the data, we can look at them with the following four functions: `head` and `tail` to observe the first and last lines; para  `describe` to obtain basic statistic summary and  `info` to obtain the data types of each columns.

In [None]:
df.head(10) #in paranthesis, th enumber of files to be displayed
            # by default, that's 5.

In [None]:
df.tail(3)

In [None]:
df.describe()

For columns with numerica data, `describe` gives information on the number of missing values, the mean value, standard deviation and some quantiles.
If the table also contains columns with categorical data,  `describe` by default only gives info in the numerical ones. If want info on the other columns, we must write:

In [None]:
df.describe(include='all')

For categorical variables, we also see information on the number of unique values, the most frequent value and its frequency. For instance, in the gender case, there are 244 entries, the most frquent value is Male (157 out of 244), there's only 2 unique values (so we assume the other value is Female)

In [None]:
df.info() 

## Querying dataframes

So far we have seen how to access certain items and/or certain columns in our dataframe. Let's see now other ways to make more dynamic queries.

A quick way to access a whole column is:

In [None]:
df['tip'].head()

We could also do

In [None]:
df.tip.head()

As pandas is built on numpy, we can use a very similar syntax.

In [None]:
df['size'].median(), df['size'].mean()

In [None]:
df['size'] > df['size'].median()

The previous command has generated a Boolean array, so we can use it as a mask to select rows in our dataframe !

In [None]:
large_tables = df[ df['size'] > df['size'].median() ]
large_tables.head()

We selected those tables that are larger than the median value, but we see they keep the original index. If we want to reset the index we must do

In [None]:
large_tables.reset_index(inplace=True, drop=True)
large_tables.head()

The criteria to filter and select rows can be as complex as we wish, using logic operators.

In [None]:
df[ (df['size'] > df['size'].median()) & (df['sex'] == 'Female') ].head()


##    Exercise 1
<img src="https://www.shareicon.net/data/256x256/2016/06/09/778169_game_512x512.png" alt="Exercise" width="50"/> 


Load the table `olympics.csv`. If you look at the file, you must skip the first row (use `skiprows`). Moreover, we can tell pandas to use the first column as index (index_col=0). Let's to some basic exploratory analysis of the data.

In [None]:
#dir='/content/Curso-IE/Week_1/data/olympics.csv'
dir='data/olympics.csv'
olympics = pd.read_csv(dir, skiprows=1, index_col=0)

To help us, run the following code to rename the columns.

In [None]:
for col in olympics.columns:
    if col[:2]=='01':
        olympics.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        olympics.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        olympics.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        olympics.rename(columns={col:'#' + col[1:]}, inplace=True) 
        
names_ids = olympics.index.str.split('\s\(') # split the index by '('

olympics.index = names_ids.str[0] # the [0] element is the country name (new index) 
olympics['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
        
olympics =  olympics.drop('Totals')
olympics.head()

In [None]:
olympics.describe()

**1.1** ¿Which country obtained the highest number of gold medals in the Summer Olympic Games ?

In [None]:
#your answer here

**1.2** List the countries that have obtained more medals (each medal counts the same) in the winter that in the summer games.

In [None]:
## your answer here

## Creating new columns and applying functions on columns

To add a new column to our table, it is as simple as assigning a new column value.

In [None]:
df['new_col'] = None
df.head()

We can also create columns using numpy functions. For instance, we will fill the new column with random numbers sampled from a normal distribution.

In [None]:
df['new_col'] = np.random.randn(len(df))
df.head()

We can also combine other columns of the dataframe.

In [None]:
del df['new_col']
df['total_bill_rand'] = np.random.randn(len(df)) + df['total_bill']
df.head()

A very important method in pandas is `apply`, which allows to apply any function( from numpy, pandas or defined by us) on a given column or list of columns. 

For instance, we can calculate the square root of the columns total_bill, tip y total_bill_rand:

In [None]:
df[['total_bill', 'tip', 'total_bill_rand']].apply(np.sqrt).head()

In [None]:
df.apply(np.max)

LEt's see another example with a user defined function_

In [None]:
def small_filter(x, y):   
    x[x<y] = 0
    return x

In [None]:
df[['total_bill']].apply(small_filter, args=[df['total_bill'].mean()]).head()

Another way to do it is to use `where`:  if the condition is True, the value is unchanged, but if it is False, the argument is replaced with the alternative argument.

In [None]:
mask=df['total_bill']>df['total_bill'].mean()
df['total_bill'].where(mask,0).head()

## Exercise 2
<img src="https://www.shareicon.net/data/256x256/2016/06/09/778169_game_512x512.png" alt="Exercise" width="50"/> 

We work on the olympics dataset. Get the countries that have obtained more points in winter than summer games. Points are calculated in the following way: bronze 1, silver 2, gold 3.

In [None]:
# Create 2 new columns points_summer and points_winter
# Your answer here

In [None]:
# Query your dataframe to find out which countries have less points_summer 
# than points_winter

#Your answer here

**2.1** ¿Which country has the smallest relative diference between summer and winter points, i.e. compute for every country
$$
dif\_rel = \frac{|points\_ summer - points\_winter|}{points\_summer + points\_winter} 
$$

In [None]:
# Create a new column with the result of the formula above

#Your answer here

In [None]:
#find the country that has the minimum value of that new column

#Your answer here

## Merging dataframes

It often happens that relevant information is spread across different tables. For this purpose, pandas has the `merge` operation, that allows to bring together items coming from different tables (a bit like a join in SQL). Let us start with some example data.

In [None]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df.head())
print()
print(student_df.head())

`merge` needs to input 2 tables, the necessary argument `how`, and one column on each table that will be used to establish the correspondence (by default, the index will be taken unless otherwise specified.

Let us see the different options that the `how` argument can take.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/joinimages.png"  width="400"/> 

With `outer` the combined table has all the items from the left table (`staff_df`) and the right table (`school_df`) even if there is no matching. For instance, we see that James is Grader on the left table, and he belongs to the Business Faculty, so both values appear in the combined table. But Kelly only appears in the Role table, so in the combined table information on her Faculty is missing (NaN):

In [None]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

In [None]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

`left` only takes the rows that appear in the left table, and tries to bring in information on the other columns from the right table (whenever possible):

In [None]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

`right` works the same as `left`:

In [None]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Now we will reset the indices and perform a `merge` specifying whicvh columns we want to use as key in each table:

In [None]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

One small thing: note that if other than the key column, there are other columns on the two tables that have the same name, when merging they will be given different names in order to avoid collisions.

In [None]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])

print(staff_df.head())
print()
print(student_df.head())

In [None]:
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

We can also merge using more than one column as keys for matching:

In [None]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
print(staff_df)
print()
print(student_df)


In [None]:
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])


## Exercise 3
<img src="https://www.shareicon.net/data/256x256/2016/06/09/778169_game_512x512.png" alt="Exercise" width="50"/> 

Let's go back to the Olympics. Load the table  `population.csv` that provides the population of each country across a number of years. To make things simple, we will only keep the populations in year 2016 (you need to select them).
Once you have done that, combine the tables olympics and population (such that there cannot be any missing values). Make sure in the combined table we have at least the following 3 columns: name of the country, number of gold medals and population.

**3.1** Which country has the higher ratio of gold medals per capita ?


In [None]:
# Your answer here

**3.2** Now do the same thing as before, but rank the 10 countries that have the largest ratio of gold medals to GDP. Can you guess before which will be the first ?

In [None]:
# Your answer here

## Grouping values in dataframes
A very important function in pandas is `groupby`, that we can use for the following purposes

* split the table in groups accoridng to some criteria
* apply some operation to some group independently
* combine the results into a new table with aggregated data

Let us get started with creating groups. The typical procedure is to split according to the values of some categorical variables (gender, day of the week, etc.)

Let us do some examples on the table `tips.csv`

In [None]:
for name, group in df.groupby(by=['sex']):
    print(name)
    print(group)

We can select more than one variable to make groups. For instance, if we group by sex and size, since there are 2 values for sex and 6 values for size, there will be 12 groups created.

In [None]:
for name, group in df.groupby(by=['sex', 'size']):
    print(name)
    print(group)


Once we have created the groups, we can apply functions over them. For instance, we can calculate the average price of the bill for each of the groups.

In [None]:
for name, group in df.groupby(by=['sex', 'size']):
    print(name, group['total_bill'].mean())
    # equivalente a np.mean(group['total_bill'])

The previous method is not so handy if we want to incorporate this information into the actual dataframe, so pandas has for this purpose the function `agg`:

In [None]:
df_bill = df.groupby(['sex']).agg({'total_bill': np.mean})
df_bill

In [None]:
df_bill = df.groupby(['sex', 'size']).agg({'total_bill': np.mean})
df_bill

Using `pivot`, we can pivot this table in order to see the results better.

In [None]:
df_bill.reset_index().pivot(index='sex', columns='size')

Do you see anything fishy ? In the female row, the average of total_bill is smaller for a table of 6 people than 4, how can that be ?

In [None]:
df[(df['sex']=='Female')&(df['size']==6)]

In [None]:
df[(df['sex']=='Female')&(df['size']==4)]

A possible explanation is that the table of 6 has only 2 cases, which happen to be cheaper meals.

We can pass more than one argument to `agg` if we want to calculate more columns by applying functions on the groups. With this syntax, agg will apply the functions `mean` and `std` to the whole dataframe (where we previously chose the columns total_bill and tip)

In [None]:
df_bill_tips = df.groupby(['sex'])['total_bill', 'tip'].agg([np.mean, np.std])

In [None]:
df_bill_tips

## Exercise 4
<img src="https://www.shareicon.net/data/256x256/2016/06/09/778169_game_512x512.png" alt="Exercise" width="50"/> 

In the tips table select all rows for which the value of `total_bill` is higher that the mean + 1 std deviation for all the meals on the same week day.

**4.1** How many rows does this dataframe have ?

**Hint:** 

1. Using `groupby` create first a small table whose rows are the days of the week (index is `day`) and whose columns are the mean and std deviation of the total_bill for all meals on that day.

2. Reset the index on that dataframe to turn `day`into an ordinary column.

3. Perform a `merge` of this table with the original table using as key column the `day`, so that each observation in the original table has its corresponding value of avg+std for each day of the week.


In [None]:
# Your answer here

## Exercise 5
<img src="https://www.shareicon.net/data/256x256/2016/06/09/778169_game_512x512.png" alt="Exercise" width="50"/> 

The database `income` can be used to predict if a worker's salary is above or below 50.000 $. For each person, the database contains the following information:
* age 
* education
* marital.status
* relationship
* race
* sex
* hourspeerweek
* nativecountry
* income (whether it is >=50K or <50K)

Load and read the dataset, and eliminate all rows that have some missing value. You can use the method `dropna`. Also, to make things simpler, substitute the values of the `Income` column by 0s and 1s (0 if the salary is smaller than 50K) and 1 otherwise.

In [None]:
dir='data/income.csv'
#dir='/content/Curso-IE/Week_1/data/income.csv'
inc = pd.read_csv(dir)
len(inc)

In [None]:
inc.describe(include='all')

We see that there are some missing values, let's remove all thpse rows that have some missing values.

In [None]:
inc.dropna(inplace=True)
inc.loc[inc.income == '<=50K', "income"] = 0
inc.loc[inc.income == '>50K', "income"] = 1
inc.head(20)

In [None]:
print(len(inc))

Try to come up with some simple rule to predict the variable `income`, and store your prediction on a new variable `income_pred`. You can check how good your prediction is by calling the function precision, that takes two vectors and computes the fraction of correct attributions to total number of predictions.

In [None]:
def precision(income, income_pred):
    ac = sum(income == income_pred)*100/(len(income))
    return "Precision: " + str(ac) + "%"


Let us start with a very simple rule:

1. We predict that only those with a PhD earn more than 50K, and those without it, earn less than 50k

In [None]:
inc["income_pred"] = 0

condition = (inc.education.isin(["Doctorate"])) 
inc.loc[condition, "income_pred"] = 1
precision(inc.income, inc.income_pred)

**5.1** Play around combining logical conditions and try to improve that precision. Can you make it higher than 80%