Every week we'll cover a new datascience topic in Python to help familiarize your team for the competition!

This week will cover dataframe manipulation and we'll be working with the Toronto *Rain Gauge Locations and Precipitation* dataset.

This is large collection of rainfall measurements taken over the past 3 years. (Link to complete dataset: https://goo.gl/gBYrb4). I have already downloaded the data for you and it is archived within the *2017_rainfall_data* folder. For the sake of simplicity, we'll just be looking at data from 2017

Lets begin by importing some libraries.

A few ground rules:

 - Remember to run every cell
     - Parts of this workshop won't work if this condition isn't met
 - Please don't change my asserts
     - If you're receiving an incorrect answer please don't change the assert answer just to get it right. You            won't learn anything and will probably fail the rest of the tutorial. Please come and ask me for help if you get stuck.

In [55]:
#don't mind this. I'm just trying to double check you work :)
def assertAns(condition, fail_str, suc_str):
    assert condition, fail_str
    print(suc_str)

In [3]:
import numpy as np #Linear algebra
import requests as req #Python's http library
import re #Python's Regex library
import pandas as pd #Python's data manipulation library
import pickle

I'm going to introduce a new type of datastructure. A dataframe.

Dataframes are basically tables. Each dataframe is comprised of colummns, rows, and a header. Yep sounds like a table. Dataframes are used to store data in an organized fashion. The reason why a dataframe resembles a table is because a table is a nice and neat format to hold information. If you look at a table, all cells underneath a header are part of the same "attribute". This standardization allows tables to store information in a neat manner.

Lets go ahead and make our first dataframe!

In [2]:
#run this cell to make your first dataframe!

a2DArray = [["apple","potato"],
            ["banana","onion"]]


myFirstDataframe = pd.DataFrame(a2DArray,columns=["fruits","vegitables"])
myFirstDataframe

Unnamed: 0,fruits,vegitables
0,apple,potato
1,banana,onion


A few things to note:
 - The *columns* attribute in the *pd.DataFrame()* command specifies the headers for the dataframe
 - The numbers on the left hand side of the dataframe are the row indexes. Note how they were automatically generated.
 - Jupyter automatically displays the dataframe IF THE DATAFRAME IS THE LAST VARIABLE THAT IS RETURNED
 - We pass in a 2D array to make the dataframe because dataframes (and subsiquently tables) are basically 2D arrays!

Now that we have a better understanding of the structure of a dataframe, lets import our dataset into a dataframe!

Since CSV files are meant to be read by excel (which converts the data into a table), data within a csv file are perfect when it comes to turning it into a dataframe.

In fact, transforming data from a CSV file into a dataframe is so easy, there is a built in function in Pandas that can turn a csv file into a dataframe in one step!

In [4]:
#note how I am navigating through our file system to find the csv file I want.
#the directory you start in is always in relation to the directory of hte current notebook
rainfallDF = pd.read_csv("../2017_rainfall_data/rainfall201706.csv")

Datasets don't only come in CSV files! They can come in a JSON format or even as a shapefile. Don't hesitate to ask how to import these files on Slack!

Lets take a look at the first five rows or our dataframe using the _head(n)_ command. By default _n=5_ so if you use _dataframe.head()_, it will return the first 5 elements of our dataframe! You'll mainly be using this scommand to have a glance at your dataframe. It's pretty useful.

Try that with our *rainfallDF*

In [1]:
rainfallDF.head()

NameError: name 'rainfallDF' is not defined

Sometimes we want to select a certain index of a dataframe. To do so, we simply put _iloc[n]_ before our dataframe:

_dataframe_.iloc[n]

Much like arrays, we replace _n_ with the index that we want to retrieve.

Try to grab the first row of our dataframe using the *iloc[]* command.

In [6]:
rainfallRow.<FILL IN>

SyntaxError: invalid syntax (<ipython-input-6-9c9b64800a1b>, line 1)

Now lets select a certain column within our dataframe!
we do so like this:

newVarToHoldColumnVals = DF["theHeaderForTheColumn"]

Try to make a variable to hold all the dates of our dataframe

In [None]:
datesColumn = <FILL IN>

Whenever we're working with a dataset we usually go through 4 stages:

1. glance at the data
2. normalization and cleaning
3. Gathering insights (Feature Engineering)
4. Generate the models

###### Glancing at the data
When we glance at the data we first want to look at the beast and see what's coming. The point of glancing at the data is to gather context on what we're looking at so we can begin to ask ourselves questions to answer!

###### Normalization and cleaning
Some things I usually look at when I'm glancing at data are the means of the columns, the unique values of categorical variables, and counts. This should spark some questions if you see an abnormally high number in a column.

###### Gathering insights
Once we understand our data, we'll have to start cleaning our data for modeling. This involves clearing rows with null values (or imputing them with averages), removing useless columns (or columns that are too similiar to others), or converting all numbers to a universal unit for your dataset.

With our data cleaned, we can move on to extract maningful information out of our existing variables. This is called feature engineering and is the main contributor to a model's success. An example of feature engineering is extracting the surnames out of someone's name to relate families on a row. Another example would be to count the number of fruits for each meal presented. The goal of feature engineering is to provide more variables for our algorithms to play with. My manually identifying these patterns, the computer doesn't have to expend additional effort to discover these patterns algorithmically.

From experience, steps 1-3 should take up 90% of your time. It also just so happens that the first three steps occur organically with the inception of your questions. Once you start looing at the data, you'll want to answer some questions... which will lead you to clean the data to gather insights from it. Then you may have more questions so you'll repeat the cycle.

###### Generating the models
Once our dataset has been prepared, we can finally have some fun and do some machine learning to make predictions! Depending on the type of dataset, you may want to implement a regression algorithm to redict a numeric value (like predicting the speed of a car given these variables) or sort variables into different categories (like passing or failing a test)

So lets glance at the data. One useful method to learn about the specifics of your dataset is by running the *describe()* function on our *rainfallDF*. So lets try that!

In [5]:
rainfallDF.describe()

Unnamed: 0,id,rainfall
count,359051.0,347427.0
mean,7715.657639,0.012369
std,81.252303,0.145904
min,7674.0,0.0
25%,7684.0,0.0
50%,7696.0,0.0
75%,7708.0,0.0
max,8049.0,9.14


That interesting. Notice how the number of rainfall "data" doesn't match the number of *ID*s. This doesn't make sense as every id must have a corresponding *rainfall* value. I suspect that some missing data is in the rainfall column. Lets run the *dropna()* function to try to remove all rows (in the dataframe) that has a column that doesn't have any data or contain "NaN".

In [8]:
droppedNa = rainfallDF.<FILL IN>

Now lets re-run the *describe()* function on our dataframe to see if the number of *ID*s and *rainfall* are the same.

In [9]:
droppedNa.<FILL IN>

Unnamed: 0,id,rainfall
count,347427.0,347427.0
mean,7716.586077,0.012369
std,82.415253,0.145904
min,7674.0,0.0
25%,7685.0,0.0
50%,7697.0,0.0
75%,7709.0,0.0
max,8049.0,9.14


Perfect. 

I'm a bit curious... What is the average amount of rainfall that usually falls? Assign that number into the *averageRainfall* variable 

In [21]:
averageRainfall = <FILL IN>

SyntaxError: invalid syntax (<ipython-input-21-f9ab99d46445>, line 1)

In [58]:
assertAns(averageRainfall == 0.012369, "That is not the average rainfall!","Test passed")

Test passed


A bit more about the methodology we just went through. Notice how we asked ourselves some questions after glancing at the data. Next, we had to clean our dataset to answer our question. Finally, we asked ourselves more questions about the data. We keep jumping back and forth on the four stages! Data analysis is a truly exploratory task which makes it hard to estimate how long it'll take to analyze data. So start early on your projects!

Lets practice selecting columns again. Try to select the *rainfall* column below and save it into 

In [26]:
rainfallColumn = rainfallDF[<FILL IN>]

Now using this column, lets try to run the *mean()* function on it. This will give us the average amount of rainfall that fell. We're going to save this value into the *avgRainfall* variable

In [None]:
avgRainfall = rainfallColumn.<FILL IN>

In [None]:
assert(avgRainfall == 0.012369, "That is not the average rainfall!", "Test Passed!")

Lets also run the *sum()* function on the *rainfallColumn* to determine how much rain had fell on Toronto in the first few months of 2017.

In [27]:
avgRainfall = rainfallColumn.<FILL IN>

In [28]:
assertAns(avgRainfall == 4297.2639990000007, "That is not the total rainfall!","Test passed")

4297.2639990000007

Notice how *averageRainFall* is the same value that we got by running hte *describe()* function earlier.

I'm going to showcase one last function: *unique()*. This function will display all unique values in a column which makes it really useful for categorical columns. More information about categorical variables here: http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm.

Lets run *unique()* on the "name" column of the *rainfallDF*. We'll save this value to the *uniqueNames* variable

In [None]:
uniqueNames = rainfallDF[<FILL IN>].<FILL IN>

In [61]:
assertAns(list(uniqueNames) == ['RG_001', 'RG_002', 'RG_003', 'RG_004', 'RG_006', 'RG_007','RG_012', 'RG_013', 'RG_014', 'RG_015', 'RG_016', 'RG_017','RG_018', 'RG_019', 'RG_020', 'RG_021', 'RG_022', 'RG_023','RG_024', 'RG_025', 'RG_027', 'RG_028', 'RG_030', 'RG_031','RG_033', 'RG_034', 'RG_035', 'RG_036', 'RG_037', 'RG_038','RG_040', 'RG_041', 'RG_042', 'RG_044', 'RG_045', 'RG_046','RG_047', 'RG_048', 'RG_049', 'RG_051', 'RG_052', 'RG_054','RG_055', 'RG_056'], "Those aren't the unique station names!","Test passed")

Test passed


These small functions (*sum()*, *mean()*, *unique()*), and others like it (*mode()*, *median()*, etc.) are great for glancing at your dataset. They allow you to take a look learn how data varies for individual columns.

Before we end this workshop, I would like to draw attention to one of the more important aspects of data analysis... learning how to generate more data based on existing data. In other words, I'll give you the tool for you to work on the third stage. This time, I'll lead by example:

In [7]:
rainfallDF["date"].unique()

array(['2017-06-01T00:00:00', '2017-06-01T00:05:00', '2017-06-01T00:10:00',
       ..., '2017-06-29T23:50:00', '2017-06-29T23:55:00',
       '2017-06-30T00:00:00'], dtype=object)

In [11]:
#I'm a curious and want to know what is the average rainfall for every hour of the day
#I'll begin by looping through each row in the dataframe and take the hour within the "date" attribute and put it
#into another column called "time"
transformedDF = rainfallDF #lets make a copy of the existing dataframe
for index, row in transformedDF.iterrows():
    dateForThisRow = row["date"]
    theTime = dateForThisRow.split("T")[1] #values in the "dates" column looks like this: 2017-06-01T00:00:00
    #I am spliting the string on the "T" and selecting the "1" index because that is the time when the value
    #was taken
    rainfallDF.set_value(index,'time', theTime) #use this command to set the values of a new column
#lets have a quick look at the transformed dataframe:
transformedDF.head()

Unnamed: 0,id,name,date,rainfall,time
0,7677,RG_001,2017-06-01T00:00:00,0.0,00:00:00
1,7677,RG_001,2017-06-01T00:05:00,0.0,00:05:00
2,7677,RG_001,2017-06-01T00:10:00,0.0,00:10:00
3,7677,RG_001,2017-06-01T00:15:00,0.0,00:15:00
4,7677,RG_001,2017-06-01T00:20:00,0.0,00:20:00


In [13]:
transformedDF["time"].unique() #wow. it seems like they take measurements every 25 minutes.

array(['00:00:00', '00:05:00', '00:10:00', '00:15:00', '00:20:00',
       '00:25:00', '00:30:00', '00:35:00', '00:40:00', '00:45:00',
       '00:50:00', '00:55:00', '01:00:00', '01:05:00', '01:10:00',
       '01:15:00', '01:20:00', '01:25:00', '01:30:00', '01:35:00',
       '01:40:00', '01:45:00', '01:50:00', '01:55:00', '02:00:00',
       '02:05:00', '02:10:00', '02:15:00', '02:20:00', '02:25:00',
       '02:30:00', '02:35:00', '02:40:00', '02:45:00', '02:50:00',
       '02:55:00', '03:00:00', '03:05:00', '03:10:00', '03:15:00',
       '03:20:00', '03:25:00', '03:30:00', '03:35:00', '03:40:00',
       '03:45:00', '03:50:00', '03:55:00', '04:00:00', '04:05:00',
       '04:10:00', '04:15:00', '04:20:00', '04:25:00', '04:30:00',
       '04:35:00', '04:40:00', '04:45:00', '04:50:00', '04:55:00',
       '05:00:00', '05:05:00', '05:10:00', '05:15:00', '05:20:00',
       '05:25:00', '05:30:00', '05:35:00', '05:40:00', '05:45:00',
       '05:50:00', '05:55:00', '06:00:00', '06:05:00', '06:10:

In [18]:
#with this new dataframe I want to create an array that describes the average amount of rainfall for every hour
#I'll start off by selecting just the first minute of the first hour of the day
firstMinuteDF = transformedDF[transformedDF["time"] == "00:00:00"]
print(len(firstHourDF))
firstMinuteDF.head()

1289

In [23]:
#Lets manipulate the DF again and add a new "hour" column
transformedDF2 = transformedDF #lets make a copy of the existing dataframe
for index, row in transformedDF2.iterrows():
    timeForThisRow = row["time"]
    theHour = timeForThisRow.split(":")[0] #values in the "time" column looks like this: 00:00:00
    #I am spliting the string on the ":" and selecting the "0" index because that is the hour when the value
    #was taken
    rainfallDF.set_value(index,'hour', theHour) #use this command to set the values of a new column
#lets have a quick look at the transformed dataframe:
transformedDF2.head()

Unnamed: 0,id,name,date,rainfall,time,hour
0,7677,RG_001,2017-06-01T00:00:00,0.0,00:00:00,0
1,7677,RG_001,2017-06-01T00:05:00,0.0,00:05:00,0
2,7677,RG_001,2017-06-01T00:10:00,0.0,00:10:00,0
3,7677,RG_001,2017-06-01T00:15:00,0.0,00:15:00,0
4,7677,RG_001,2017-06-01T00:20:00,0.0,00:20:00,0


I'm also going to teach you some new selecting commands. Watch How I am selecting all rows that were measured from the first hour:

In [25]:
firstHourDF = transformedDF2[transformedDF2["hour"] == "00"] #I'm saying: select all rows that have an "hour" value
#of "00"
print(firstHourDF["rainfall"].mean())
#the following value is the average amount of rainfall for all measurements that were taken in the first hour

0.011778084270050341

In [31]:
#now lets try the same measurement for all hours of the day
arrToHoldHours = []
for hour in range(23):
    strHour = str(hour)
    #here I am formating the "hour" so it matches with the style of the hour values in the DF
    if len(strHour) == 1:
        strHour = "0" + strHour
    #lets actually do the selecting now
    selectedHour = transformedDF2[transformedDF2["hour"] == strHour]
    avgRainfallForThatHour = selectedHour["rainfall"].mean() #calculating the mean
    arrToHoldHours.append("hour " + strHour + ": " + str(avgRainfallForThatHour))
    
#andddd here we are printing the data
for oneHour in arrToHoldHours:
    print(oneHour)

hour 00: 0.0117780842701
hour 01: 0.0149699861687
hour 02: 0.0421227027774
hour 03: 0.0203372478585
hour 04: 0.00289961336647
hour 05: 0.00738124827396
hour 06: 0.0254926815797
hour 07: 0.0270355785838
hour 08: 0.00957183332182
hour 09: 0.00285955328124
hour 10: 0.00885984665331
hour 11: 0.00914015334669
hour 12: 0.00492495854063
hour 13: 0.0194678176105
hour 14: 0.0415913889464
hour 15: 0.00759516908213
hour 16: 0.0015831433506
hour 17: 0.00109329466197
hour 18: 0.00149875553097
hour 19: 0.00667431890472
hour 20: 0.0136510261903
hour 21: 0.0059092039801
hour 22: 0.00209551454834


We can see that most percipitation happens at 2AM. This could be a fluke or an actual observation. Sincse the measurements were carried out every 25 minutes, hours such as hour 00 had three measurements whereas hour 01 had only 2. It might be interesting to see how measurements from different seasons differ and even from different years.

Now before I let out the workshop, I'll teach you one last selection method; how to impose multiple rules when selecting rows. You may have noticed how I used *transformedDF[transformedDF["time"] == "00:00:00"]* to select all measurements that were taken from the first minute of the day. But what if I want to select values that came from the first minute AND from a specific station... say from *RG_001*. I'll have to use syntax to define multi-condition selection. Here's how it looks like:

In [37]:
firstMinOfDayAndRG_001 = transformedDF[(transformedDF["time"] == "00:00:00") & (transformedDF["name"] == "RG_001")]
#notice the parenthesis between each condition
print("Amount selected: " + str(len(firstMinOfDayAndRG_001)))
firstMinOfDayAndRG_001.head()

Amount selected: 30


Unnamed: 0,id,name,date,rainfall,time,hour
0,7677,RG_001,2017-06-01T00:00:00,0.0,00:00:00,0
288,7677,RG_001,2017-06-02T00:00:00,0.0,00:00:00,0
576,7677,RG_001,2017-06-03T00:00:00,0.0,00:00:00,0
864,7677,RG_001,2017-06-04T00:00:00,0.0,00:00:00,0
1152,7677,RG_001,2017-06-05T00:00:00,0.0,00:00:00,0


Sometimes we want to save our work by saving the dataframe we made. (So we don't have to process everything all over again everytime we close Jupyter). The _pickle_ Python library can save Python dataframes as a _.pkl_ file. To save our dataframe, run this command:

In [12]:
transformedDF.to_pickle("../saved_dataframes/workshop1RainfallDF") #note we'll save our pickle file into the 
#saved_dataframes folder. Keeping all your data in one place will keep you organised.

TypeError: to_pickle() missing 1 required positional argument: 'path'

That's it! Congrats for making this far. I hope this tutorial was helful and please don't hesitate to ask for help. Feedback that you provide will be taken into consideration for future workshops and I hope that you learned something. Good luck on your projects!