# 7-1: Python Pandas Tutorial 1

Pandas is a python module that makes data science easy and effective.

For example, say you have a data set for weather in New York for the month of January. You may have questions such as: 1) What was the max temperature in NY in the month of January?, 2) On which days did it rain?, and 3) What was the average speed of wind during the month?

Excel is great until you have a really large (think millions) data set. Then, Excel becomes slow and difficult to do analysis with.

The next way to analyze big sets of data is Python. You can create a program to get those answers. But, you have to spend a lot of time writing the code. And it's not very convenient (you have to test, it might have bugs, etc.). If you want to do other analytics, you have to create even more code.

With pandas framework, we can do the same thing, with only a few lines of code.

Some terms in the video:

**df** = dataframe object (the core of pandas)...if you print it, it shows as a data table

**EST** = the date column

**data munging** or **data wrangling** = the process of cleaning messy data...you have to make your data ready for your next step (processing the data). Python allows you to do data munging easily. In Python Pandas there is a method called **fillna()**. If you do **fillna(0)** it fills all empty data cells with 0.

# 7-2: Dataframe Basics

**Dataframe** = a main object in Pandas. It is used to represent data with rows and columns (tabular or excel spreadsheet like data).

Today:
1. Creating dataframe
2. Dealing with rows and columns
3. Operations: min, max, std, describe
4. Conditional selection
5. set_index

In [4]:
import pandas as pd
df = pd.read_csv("7-2_weather_data.csv")
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


You can also create the df through a dictionary. Do **df = pd.DataFrame(weather_data)**. So, use **read_csv()** or through a dictionary.

## Rows and columns

In [6]:
df.shape

(6, 4)

In [8]:
rows, columns = df.shape

In [9]:
rows

6

In [10]:
columns

4

"Shape" is the dimension. So **df.shape** shows you how many rows and columns. If you want to store it as a tuple, you can do **rows, columns = df.shape**. Then, you can print **rows**, and get 6. 

In [11]:
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [12]:
df.head(2)

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny


**df.head()** prints off the first few rows. If you have a lot of rows, you don't want to print off the full dataframe. If you only want to print, say, 2 rows, you can do **df.head(2)**.

In [13]:
df.tail()

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [14]:
df.tail(1)

Unnamed: 0,day,temperature,windspeed,event
5,1/6/2017,31,2,Sunny


**df.tail()** prints the last 5 rows. Again, **df.tail(1)** only shows the last 1 row.

In [16]:
df[2:5]

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


You can print a slice of the **df** with the slice **[:]** operator. Remember, it does not print the last number (that's why it only prints rows 2 through 4).

In [17]:
df[:]

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [18]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


To print everything, you can do **df[:]** or just **df**.

In [19]:
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

Prints the number of columns you have.

In [20]:
df.day

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [21]:
df['event']

0     Rain
1    Sunny
2     Snow
3     Snow
4     Rain
5    Sunny
Name: event, dtype: object

Either **df.day** or **df['event']** prints the contents of a specific column. As if you're accessing a property of your dictionary.

In [22]:
type(df['event'])

pandas.core.series.Series

Reference information. Gives you the type (it's a series).

In [23]:
df[['event', 'day']]

Unnamed: 0,event,day
0,Rain,1/1/2017
1,Sunny,1/2/2017
2,Snow,1/3/2017
3,Snow,1/4/2017
4,Rain,1/5/2017
5,Sunny,1/6/2017


Prints specific columns.

## Operations

Pandas has a ton of operations. You can Google it to find a list.

In [25]:
df['temperature'].max()

35

**max()** gives you maximum data in data set.

In [26]:
df['temperature'].mean()

30.333333333333332

**mean()** gives you the mean (average) of the data set.

In [27]:
df['temperature'].min()

24

**min()** gives you the minimum of the data set.

In [28]:
df['temperature'].std()

3.8297084310253524

**std()** gives you the standard deviation.

In [29]:
df.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


**describe()** gives you the statistics of the columns that have numbers.

## Conditional selection

In [30]:
df[df.temperature>=32]

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
4,1/5/2017,32,4,Rain


In [31]:
df[df.temperature==df.temperature.max()]

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny


In [32]:
df[df.temperature==df['temperature'].max()]

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny


**Example 1** shows how to show a conditional set of data (all data where the temp is greater or equal to 32). **Examples 2 and 3** are two variations of how to show the max data. Not just print the data, but show it in the dataframe. **Example 3** is particularly useful if your column names have spaces (for instance if the "windspeed" column was actually "wind speed" with a space in-between). Then you can put the column name in single quotes.

In [33]:
df['day'][df.temperature==df['temperature'].max()]

1    1/2/2017
Name: day, dtype: object

Here you can print specific columns by adding **df['day']** before. It prints only the day when the temperature was maximum.

In [37]:
df[['day','temperature']][df.temperature==df['temperature'].max()]

Unnamed: 0,day,temperature
1,1/2/2017,35


Here you can print two columns - day and temperature - that shows when the temperature was maximum.

## set_index

In [38]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


Here you can see that when you run the dataframe, it assigns indexes automatically (0, 1, 2, 3, etc.).

In [39]:
df.index

RangeIndex(start=0, stop=6, step=1)

Prints the index range (here starting with 0 and ends with 5).

In [40]:
df.set_index('day')

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/2/2017,35,7,Sunny
1/3/2017,28,2,Snow
1/4/2017,24,7,Snow
1/5/2017,32,4,Rain
1/6/2017,31,2,Sunny


You can change the set index to another column, such as the **day** column, with **set_index**. Be careful though, this does not change the original index...to modify the original index, use...

In [43]:
df.set_index('day', inplace=True)

In [44]:
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/2/2017,35,7,Sunny
1/3/2017,28,2,Snow
1/4/2017,24,7,Snow
1/5/2017,32,4,Rain
1/6/2017,31,2,Sunny


Now you can use the actual date as an index. Such as...

In [45]:
df.loc['1/3/2017']

temperature      28
windspeed         2
event          Snow
Name: 1/3/2017, dtype: object

You can also reset back to the original index...

In [46]:
df.reset_index(inplace=True)

In [47]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


Okay, so let's set the index to the event.

In [48]:
df.set_index('event', inplace=True)
df

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rain,1/1/2017,32,6
Sunny,1/2/2017,35,7
Snow,1/3/2017,28,2
Snow,1/4/2017,24,7
Rain,1/5/2017,32,4
Sunny,1/6/2017,31,2


In [49]:
df.loc['Snow']

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Snow,1/3/2017,28,2
Snow,1/4/2017,24,7


You can get the values associated with certain words with **loc[]**.