## Setup

This guide was written in Python 3.6.

### Python and Pip

Download [Python](https://www.python.org/downloads/) and [Pip](https://pip.pypa.io/en/stable/installing/).

### Other

Let's install the modules we'll need for this tutorial. Open up your terminal and enter the following commands to install the needed python modules: 

```
pip3 install os
pip3 install pandas
```


## Introduction

We've gone over Data Acquisition as of now, so we know how to <i>get</i> our data. But once you have the data, it might not be in the best shape. You might have scraped a bunch of data from a website, but need it in the form of a dataframe to work with it in an easier manner. This process is called data preparation - preparing your data in a format that's easiest to form with.

### Overview

<b> Data Acquisition: </b> Reading and writing with a variety of file formats and databases. <br>
<b> Preparation: </b> Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis. <br>
<b> Transformation: </b> Applying mathematical and statistical operations to groups of data sets to derive new data sets. For example, aggregating a large table by group variables. <br>
<b> Modeling and computation: </b> Connecting your data to statistical models, machine learning algorithms, or other computational tools <br>
<b> Presentation: </b> Creating interactive or static graphical visualizations or textual summaries <br>


## Pandas

Pandas allows us to deal with data in a way that us humans can understand it - with labelled columns and indexes. It allows us to effortlessly import data from files such as CSVs, allows us to quickly apply complex transformations and filters to our data and much more. Along with Numpy and Matplotlib, it helps create a really strong base for data exploration and analysis in Python. 


In [2]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

### Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [3]:
obj = Series([4, 7, -5, 3])
list1 = [4,7,-5,3]
print(Series(list1))

0    4
1    7
2   -5
3    3
dtype: int64


Often it will be desirable to create a Series with an index identifying each data point:


In [4]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)

d    4
b    7
a   -5
c    3
dtype: int64


You can also take a dictionary and convert it to a Series:


In [6]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
print(obj3)
print(type(obj3))

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
<class 'pandas.core.series.Series'>


### DataFrames

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [7]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002], 
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

Then we take this and convert it to a DataFrame:

In [8]:
frame = DataFrame(data)

This gets us:

In [9]:
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9


You can also specify the sequence of columns by:


In [10]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


### Apply

Lets's generate a random dictionary:

In [11]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(frame)

               b         d         e
Utah    0.286915  0.756919 -1.151341
Ohio   -0.517600  0.127705 -0.273800
Texas  -1.314092  0.307776  1.301489
Oregon -1.855136 -1.452916  0.087474


With this, we can apply a function on a DataFrame:

In [12]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.286915,0.756919,1.151341
Ohio,0.5176,0.127705,0.2738
Texas,1.314092,0.307776,1.301489
Oregon,1.855136,1.452916,0.087474


We can also apply functions with the `apply()` method:

In [14]:
f = lambda x: x.max() - x.min()

print(frame.apply(f))

b    2.142052
d    2.209835
e    2.452830
dtype: float64


In [16]:
f = lambda x: np.abs(x)

frame = frame.apply(f)

print(frame)

               b         d         e
Utah    0.286915  0.756919  1.151341
Ohio    0.517600  0.127705  0.273800
Texas   1.314092  0.307776  1.301489
Oregon  1.855136  1.452916  0.087474


#### Sorting

To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [17]:
frame.sort_index()

Unnamed: 0,b,d,e
Ohio,0.5176,0.127705,0.2738
Oregon,1.855136,1.452916,0.087474
Texas,1.314092,0.307776,1.301489
Utah,0.286915,0.756919,1.151341


Naive Bayes works on Bayes Theorem of probability to predict the class of a given data point. Naive Bayes is extremely fast compared to other classification algorithms and works with an assumption of independence among predictors. 

The Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

### Challenge

Recall Bayes Theorem, which provides a way of calculating the posterior probability. Its formula is as follows:

![alt text](https://github.com/ByteAcademyCo/stats-programmers/blob/master/bayes.png?raw=true "Logo Title Text 1")

Let's go through an example of how the Naive Bayes Algorithm works using `pandas`. We'll go through a classification problem that determines whether a sports team will play or not based on the weather. 

First, let's load the module data:


In [None]:
import pandas as pd
f1 = pd.read_csv("./weather.csv")
print(f1)

#### Frequency Table

The first actual step of this process is converting the dataset into a frequency table. Using the `groupby()` function, we get the frequencies:

In [None]:
df = f1.groupby(['Weather','Play']).size()
print(df)

Now let's split the frequencies by weather and yes/no. Let's start with the three weather frequencies:


In [None]:
df2 = f1.groupby('Weather').count()
print(df2)

Now let's get the frequencies of yes and no:

In [None]:
df1 = f1.groupby('Play').count()
print(df1)

#### Likelihood Table


Next, you would create a likelihood table by finding the probabilites of each weather condition and yes/no. This will require that we add a new column that takes the play frequency and divides it by the total data occurances. 



In [None]:
df1['Likelihood'] = df1['Weather']/len(f1)
df2['Likelihood'] = df2['Play']/len(f1)

This gets us dataframes that looks like:

In [None]:
print(df1)
print(df2)

Now, we're able to use the Naive Bayesian equation to calculate the posterior probability for each class. The highest posterior probability is the outcome of prediction.


#### Calculation

So now we need a question. Let's propose the following: "Players will play if the weather is sunny. Is this true?"

From this question, we can construct Bayes Theorem. So what's our P(A|B)? P(Yes|Sunny), which gives us:

P(Yes|Sunny) = (P(Sunny|Yes)*P(Yes))/P(Sunny)

Based off the likelihood tables we created, we just grab P(Sunny) and P(Yes). 

In [None]:
ps = df2['Likelihood']['Sunny'] # .36
print("Sunny Likelihood: %f" % ps) 
py = df1['Likelihood']['Yes'] # .65
print("Yes Likelihood: %f" %py)

That leaves us with P(Sunny|Yes). This is the probability that the weather is sunny given that the players played that day. In `df`, we see that the total number of `yes` days under `sunny` is 3. We take this number and divide it by the total number of `yes` days, which we can get from `df`. 


In [None]:
psy = df['Sunny']['Yes']/df1['Weather']['Yes'] # 3/9
print(psy)

Now, we just have to plug these variables into bayes theorem: 

In [None]:
p = (psy*py)/ps
print(p)

This tells us that the answer to our original question is yes!

## Extracting Zipfiles

Oftentimes, you'll have to download a large number of files. They might come in the form of zipfiles. Instead of manually unzipping them, you can use Python to extract these files for you, using the `os` and `zipfile` modules. 

### OS Module

The `os` module provides us a portable way of using operating system dependent functionality. We'll begin exploring it's capabilities now:

In [18]:
import os

With the `os.getcwd()` method, we can get the current directory we're in. This is particularly useful when working with a large number of files in a certain folder. In this case, we'll use `os` to work with the zipfiles:

In [19]:
cwd = os.getcwd()
print(cwd)
dir_path  = os.path.join(cwd, 'Example')
print(dir_path)

/Users/lesleycordero/Desktop/python-data-prep
/Users/lesleycordero/Desktop/python-data-prep/Example


With `os`, you can check to see if a certain directory exists. Here, if it doesn't exist, we create that folder, which we can do with the `os.makedirs()` function:


In [20]:
if not os.path.exists(dir_path):
    os.makedirs(dir_path)

Lastly, as we do with the `ls` command on the terminal, we can use `os.listdir()` to get the contents of whatever path we provide as an argument.


In [23]:
os.listdir(cwd)

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'Data Preparation.ipynb',
 'dorms.csv',
 'ex.R',
 'Example',
 'example.zip',
 'housing.csv',
 'msleep_ggplot2.csv',
 'names_add.csv',
 'names_extra.csv',
 'names_original.csv',
 'README.md',
 'uk_rain_2014.csv',
 'weather.csv']

From there, we can move onto working with zipfiles!

### ZipFile

The `zipfile` module is a powerful module that allows us to extract files from a zipped folder. First, we import the needed modules and assign the name of the zipfile we'll be extracting to a variable:

In [24]:
import zipfile
import os
zip_name = 'example.zip'

Next, we get the current directory path and join it with the zipfile name to get its exact path:


In [25]:
cwd = os.getcwd()
zip_path = os.path.join(cwd, zip_name)
print(zip_path)

/Users/lesleycordero/Desktop/python-data-prep/example.zip


Next, we use the ZipFile class to change the file to a Zipfile object. With this object, we use the `ZipFile.extract()` function to extract all the contents:


In [26]:
with zipfile.ZipFile(zip_path, 'r') as z:
    z.extractall(cwd)

And lastly, we can see what's in the zipfile with: 

In [28]:
os.listdir(cwd)

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'Data Preparation.ipynb',
 'dorms.csv',
 'ex.R',
 'Example',
 'example.zip',
 'housing.csv',
 'msleep_ggplot2.csv',
 'names_add.csv',
 'names_extra.csv',
 'names_original.csv',
 'readings',
 'README.md',
 'uk_rain_2014.csv',
 'weather.csv']

## Data Merging

If you encounter two different datasets that contain the same type of information, you might consider merging them for your analyses. This is yet another functionality built into `pandas`. 

Let's go through an example containing student data. `d1` contains 5 of the samples and `d2` contains 2 of them: 

In [33]:
d1 = pd.read_csv("./names_original.csv")
print(d1)
print(type(d1))

  First Name  Last Name
0     Lesley    Cordero
1       Ojas      Sathe
2      Helen       Chen
3        Eli   Epperson
4      Jacob  Greenberg
<class 'pandas.core.frame.DataFrame'>


In [34]:
d2 = pd.read_csv("./names_add.csv")
print(d2)

  First Name Last Name
0     Martin     Perez
1      Menna   Elsayed


### Concatenation 

Instead of working with two separate datasets, it's much easier to simply merge, so we do this with the `concat()` function:


In [35]:
result = pd.concat([d1,d2])
print(result)

  First Name  Last Name
0     Lesley    Cordero
1       Ojas      Sathe
2      Helen       Chen
3        Eli   Epperson
4      Jacob  Greenberg
0     Martin      Perez
1      Menna    Elsayed


Now, you might be asking what will happen if one of the datasets has more columns than other - will they still be allowed to merge? Let's try this example with another dataset:

In [37]:
d3 = pd.read_csv("./names_extra.csv")
print(d3)

  First Name Last Name                   Major
0     Martin     Perez  Mechanical Engineering
1      Menna   Elsayed               Sociology


If we use the same `concat()` function, we get:

In [40]:
result1 = pd.concat([d1, d3])
print(result1)

  First Name  Last Name                   Major
0     Lesley    Cordero                     NaN
1       Ojas      Sathe                     NaN
2      Helen       Chen                     NaN
3        Eli   Epperson                     NaN
4      Jacob  Greenberg                     NaN
0     Martin      Perez  Mechanical Engineering
1      Menna    Elsayed               Sociology


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


Notice the `NaN` values - these are undefined values indicating there wasn't any data to be displayed. `pandas` will simply fill in the missing data for each sample where it's unavailable:  

In [41]:
print(result1['Major'][0])

0                       NaN
0    Mechanical Engineering
Name: Major, dtype: object


### Merging

Now, how do we merge two datasets with differing columns? Well, let's take a look at our datasets:

In [42]:
h1 = pd.read_csv("./housing.csv")
print(h1)

          Dorm            Name
0  East Campus      Helen Chen
1     Broadway   Danielle Jing
2      Shapiro    Craig Rhodes
3         Watt  Lesley Cordero
4  East Campus    Martin Perez
5     Broadway   Menna Elsayed
6      Wallach   Will Essilfie


In [43]:
h2 = pd.read_csv("./dorms.csv")
print(h2)

          Dorm Street    Cost
0     Broadway  114th    9000
1      Shapiro  115th    9500
2         Watt  113th   10500
3  East Campus  116th  11,000
4      Wallach  114th    9500


With the `merge()` function in pandas, we can specify which column to merge on and what kind of join to specify. By default merge does an 'inner' join, but here we set it to a left join:

In [44]:
house = pd.merge(h1, h2, on="Dorm", how="left")
print(house)

          Dorm            Name Street    Cost
0  East Campus      Helen Chen  116th  11,000
1     Broadway   Danielle Jing  114th    9000
2      Shapiro    Craig Rhodes  115th    9500
3         Watt  Lesley Cordero  113th   10500
4  East Campus    Martin Perez  116th  11,000
5     Broadway   Menna Elsayed  114th    9000
6      Wallach   Will Essilfie  114th    9500


In [45]:
pd.concat([h1,h2])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,Cost,Dorm,Name,Street
0,,East Campus,Helen Chen,
1,,Broadway,Danielle Jing,
2,,Shapiro,Craig Rhodes,
3,,Watt,Lesley Cordero,
4,,East Campus,Martin Perez,
5,,Broadway,Menna Elsayed,
6,,Wallach,Will Essilfie,
0,9000.0,Broadway,,114th
1,9500.0,Shapiro,,115th
2,10500.0,Watt,,113th
