# Pandas quick start
Pandas is a 3rd-party, open source, library used to for data science. It is perhaps the most important library for you, as a student of analytics.

At a high level, Pandas provides the following functionality:
1. Reading and writing data in various formats: csv, sql, feather and many others
2. A set of data structures in which to store data (so higher level than lists, tuples and dictionaries)
3. Functions to transform data in _many_ ways: individual columns, operating on multiple columns at once, aggregating in total or in by categories (aka group by), visualizing datasets, etc.

Further, _upstream_ libraries, such as ones providing machine learning algorithms (scikit-learn) know how to consume Pandas data structures.

Extremely helpful Pandas cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf (search the web, there are many more, just as useful)

In [None]:
import numpy as np # <= numpy is only used once, for the np.log function (although pandas is built on top of it)
import pandas as pd # <= `pd` is almost always the abbreviation used for pandas
import seaborn as sn # <= seaborn is not part of pandas, but very useful charting library (built on top of matplotlib)

In [None]:
# remember, since we will be drawing some charts, we need to execute this line - because seaborn uses matplotlib
%matplotlib inline 

# the following line tells pandas to avoid scietific notation
pd.set_option('display.float_format', '{:.2f}'.format)

## Quick walk-through of Pandas

### Load file and take a quick look at it

Note that this file is available at: https://www.kaggle.com/kumarajarshi/life-expectancy-who/home
Go to that URL, click 'Download' which will start downloading a zip file. Load it as shown below:

In [None]:
# Read csv file
life_df = pd.read_csv("../../datasets/life-expectancy/life-expectancy-who.zip")

In [None]:
life_df.head() # Look at the first 5 lines to visually inspect data

In [None]:
life_df.shape # This file has 2,938 records (rows) and 11 columns

In [None]:
life_df.columns # List of columns

**WARNING** Notice that some columns have an extra space!

In [None]:
life_df.dtypes

In [None]:
life_df.describe() # quick summary of all the columns

In [None]:
# Warning, this step may take a minute or two to complete
%time sn.pairplot(life_df) # look at all variables at once - pair-plot

The previous plot isn't very useful because there are too many columns. What if we had fewer columns? Let's just select the first 5 columns:

In [None]:
first_few_df = life_df[['Country', 'Year', 'Population', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ']]
first_few_df.head()

Let's also limit the data to the year 2015

In [None]:
first_few_2015_df = first_few_df[first_few_df.Year == 2015]
first_few_2015_df

In [None]:
first_few_2015_df.shape

In [None]:
%time sn.pairplot(first_few_2015_df)

In [None]:
first_few_2015_df['Country'] # Show the column 'Country'

In [None]:
first_few_2015_df['Life expectancy '] # <= Notice the extra space!

In [None]:
first_few_2015_df['Life expectancy '].value_counts() # So what are common life expectancies?

Previous list of numbers is not very useful, let's plot the distribution:

In [None]:
first_few_2015_df['Life expectancy '].plot.hist()

In [None]:
first_few_2015_df['infant deaths'].plot.hist(10)

In [None]:
np.log(first_few_2015_df['infant deaths']+1).plot.hist(10) # just to "zoom" in quickly - dirty hack

In [None]:
#There are some cases where infant deaths are over 200???
first_few_2015_df[first_few_2015_df['infant deaths'] > 200]

The numbers for infant deaths are _so_ high that we need to go back to the data source and double check our understanding.

**Exercise** Check the exact definition of 'infant deaths.'

The 'Measles' value is defined as 'number of reported cases per 1,000 population.' Let's find the actual number of measles per country (in 2015):

In [None]:
first_few_2015_df['Measles'] * first_few_2015_df['Population'] # What happened? (hint, extra space)
# Why did you get the error and have you seen that error before? 

In [None]:
# In the calculation below, notice that we just added the two vectors, as if they were numbers...no loops!!
first_few_2015_df['Measles '] * first_few_2015_df['Population']

Something _very_ interesting happened above. We added two lists or vectors of numbers, without using a loop! Pandas and numpy (and matrix math) works this way.

Let's add this column back to our data frame:

In [None]:
# We are creating a new column!
first_few_2015_df['Total Measles'] = first_few_2015_df['Measles '] * first_few_2015_df['Population']

In [None]:
first_few_2015_df

### Cheap version of Pandas

Let's try to make a very tiny, very silly version of Pands ourselves.

The first implementation is just a dictionary with column names as keys and lists of data as values. Here is an example:

In [1]:
import random

In [8]:
df = {
   "col1": [random.random() for x in range(10)]
 , "col2": [random.random() for x in range(10)]
 , "col3": [random.random() for x in range(10)]
}
df

{'col1': [0.6398486505677585,
  0.3920191311411376,
  0.0764289871014111,
  0.27912566590029975,
  0.6085028811965094,
  0.5844715519424225,
  0.4493329975318324,
  0.255595127621711,
  0.9415402958374092,
  0.10611719951008902],
 'col2': [0.5732160920978465,
  0.7395429181420299,
  0.46768085671348714,
  0.1634372437932784,
  0.6812093521811051,
  0.8066784492113644,
  0.01504381743026273,
  0.7446201983015571,
  0.0028980115185667232,
  0.11589449403020202],
 'col3': [0.918990817420139,
  0.5478576979453401,
  0.9624975236122917,
  0.25060776136028673,
  0.5608692014295706,
  0.7330310730936478,
  0.07407226163415981,
  0.24951115543002644,
  0.9174192241058191,
  0.05865872181721099]}

In [9]:
df['col1']

[0.6398486505677585,
 0.3920191311411376,
 0.0764289871014111,
 0.27912566590029975,
 0.6085028811965094,
 0.5844715519424225,
 0.4493329975318324,
 0.255595127621711,
 0.9415402958374092,
 0.10611719951008902]

#### Write a function to read csv files

In [25]:
import collections

def create_df_from_csv(file):
    
    num_of_columns = None
    header = None
    df = collections.defaultdict(list)
    
    with open(file, "r") as f:
        for line in f:
            tokens = line.split(",") # recall that this is not the best way to parse csv files (python has a csv library built-in)
            if not num_of_columns: 
                num_of_columns = len(tokens) # count the number of columns in the first row
                header = [t.strip() for t in tokens] # assumes the first row will always contain header
            else:
                for idx, col in enumerate(header): df[col].append(tokens[idx].strip()) # assumes all rows have equal number of columns
    return df

For the next step, create a 10 line version of the file in datasets/life-expectancy, otherwise there will be too much data for you to see the structure of the dataframe (but the code should work, either way)

In [23]:
le_df = create_df_from_csv("../../datasets/life-expectancy/life_expectancy_10.csv")
le_df

defaultdict(list,
            {'Country': ['Afghanistan',
              'Afghanistan',
              'Afghanistan',
              'Afghanistan',
              'Afghanistan',
              'Afghanistan',
              'Afghanistan',
              'Afghanistan',
              'Afghanistan'],
             'Year': ['2015',
              '2014',
              '2013',
              '2012',
              '2011',
              '2010',
              '2009',
              '2008',
              '2007'],
             'Status': ['Developing',
              'Developing',
              'Developing',
              'Developing',
              'Developing',
              'Developing',
              'Developing',
              'Developing',
              'Developing'],
             'Life expectancy': ['65',
              '59.9',
              '59.9',
              '59.5',
              '59.2',
              '58.8',
              '58.6',
              '58.1',
              '57.5'],
             'Adult Mor

In [24]:
le_df['Schooling']

['10.1', '10', '9.9', '9.8', '9.5', '9.2', '8.9', '8.7', '8.4']

**Exercise** The function above is not _production ready._ What are some ways things can go wrong?
**Exercise** How will classes combine the function (verb) and the dataframe data structure (noun)?