#### The DataFrame

The DataFrame is the main object in Pandas. It is used to represent data in rows and columns (think excel or other spreadsheet programs)

In this lecture, we will cover the following...

#### 1) Creating DataFrame

#### 2) Dealing with rows and columns

#### 3) Operations: min, max, std, describe

#### 4) set_index    

#### 1) Creating DataFrame

In [2]:
import pandas as pd # This is us importing the pandas module into Python

##### Creating a DataFrame by reading in a CSV file

In [8]:
df = pd.read_csv('D:\\data\\weather.csv') # This is us creating a dataframe by importing the information from a CSV file

In [9]:
df.head() 

Unnamed: 0,day	temperature	windspeed	event
0,1/1/2017\t32\t6\tRain
1,1/2/2017\t35\t7\tSunny
2,1/3/2017\t28\t2\tSnow
3,1/4/2017\t24\t7\tSnow
4,1/5/2017\t32\t4\tRain


This looks like a <TAB> separated filke rather than a comma separated file but luckily we know how to deal with that

In [10]:
df = pd.read_csv('D:\\data\\weather.csv', sep='\t') # Not sure why it is a <TAB> separated file

In [11]:
df.head() # Displays perfectly now

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


##### Creating a DataFrame by using a Python dictionary/DataFrame constructor

In [13]:
weather_data = {
    'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
    'temperature': [32,35,28,24,32,31],
    'windspeed': [6,7,2,7,4,2],
    'event': ['Rain', 'Sunny', 'Snow','Snow','Rain', 'Sunny']
}

Each of the keys in the dictionary, day, temp, windspeed & event become your df columns and the values for the keys become your rows

In [16]:
df02 = pd.DataFrame(weather_data) # This is the DataFrame constructor
df02

Unnamed: 0,day,event,temperature,windspeed
0,1/1/2017,Rain,32,6
1,1/2/2017,Sunny,35,7
2,1/3/2017,Snow,28,2
3,1/4/2017,Snow,24,7
4,1/5/2017,Rain,32,4
5,1/6/2017,Sunny,31,2


#### 2) Dealing with rows and columns
DataFrames are all about rows and columns

In [17]:
df02.shape # This tells us the dimensions, how many rows and columns our, df has. (Think numpy array.)

(6, 4)

The result is a tuple. When you want to print the rows and columns, you can do that...

In [18]:
rows, columns = df.shape

In [19]:
rows

6

In [20]:
columns

4

In [21]:
df.head() # Prints just the first five rows of the df by default. 

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


If you only need the first two rows then pass it an argument of n where n is the number of rows that you want to display. Useful to see if your data has imported without having to print out the entire df (which might be millions of rows.)

The opposite of head is obviously tail and that will print the last 5 rows, by default, of your df. Again, you can pass it anargument n

##### Slicing your data

In [22]:
df[2:5] # Will print out rows 2 up to, but not including, row 5

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


To print everything, you can either use df[:] or simply df.

In [23]:
df.columns # Will list all of your columns by their name including the Index

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [24]:
df.day # and df['day'] will print out just the values for the day column

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [25]:
type(df['event']) # To get the data type of the event column

pandas.core.series.Series

We can see that pandas recognises the columns as a Pandas Series (all columns are).

In [27]:
type(df.event) # You can use the dot or bracket notation when slcing Series data

pandas.core.series.Series

In [26]:
df[['event', 'day']] # To print however many columns you wish rather than all

Unnamed: 0,event,day
0,Rain,1/1/2017
1,Sunny,1/2/2017
2,Snow,1/3/2017
3,Snow,1/4/2017
4,Rain,1/5/2017
5,Sunny,1/6/2017


#### 3) Operations: min, max, std, describe
To find information styored in a Panadas series is pretty straightforward

In [29]:
df['temperature'] # We are going to pull out the max temp from this series

0    32
1    35
2    28
3    24
4    32
5    31
Name: temperature, dtype: int64

In [28]:
df['temperature'].max() # To find the maximum temperature (or highest value) in the Series/Column

35

In [30]:
df['temperature'].min()

24

In [31]:
df['temperature'].mean()

30.333333333333332

In [32]:
df['temperature'].std()

3.8297084310253524

In [34]:
df.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


.describe() prints statistics on your df but only the numerical columns. 

##### Filtering your data

In [35]:
df[df.temperature >= 32]# This filters the df and returns all rows where the temp was greater than or equal to 32

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
4,1/5/2017,32,4,Rain


In [36]:
df[df['temperature'] == df['temperature'].max()] # This is to retrieve the row that cantains the highest temp

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny


Of course, we can also use dot notation to do this...

df[df.temperature == df.temperature.max()]

However, where the column name contains a space, you will have to use the bracket notation

In [38]:
df['day'][df['temperature'] == df['temperature'].max()] # To find just the day when the temp was at the maximum

1    1/2/2017
Name: day, dtype: object

In [40]:
df[['day', 'temperature']][df['temperature'] == df['temperature'].max()] # To find the day and temp of the max temp in our df

Unnamed: 0,day,temperature
1,1/2/2017,35


#### 4) set_index

In [43]:
df02 # We can see the auto-generated Pandas Index is being used because we didn't specify one to be used

Unnamed: 0,day,event,temperature,windspeed
0,1/1/2017,Rain,32,6
1,1/2/2017,Sunny,35,7
2,1/3/2017,Snow,28,2
3,1/4/2017,Snow,24,7
4,1/5/2017,Rain,32,4
5,1/6/2017,Sunny,31,2


In [44]:
df02.index # Shows my index as a range starting from zero to six in steps of one

RangeIndex(start=0, stop=6, step=1)

In [45]:
df02.set_index('day') # This sets our index, for our df, to what used to be the day column

Unnamed: 0_level_0,event,temperature,windspeed
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,Rain,32,6
1/2/2017,Sunny,35,7
1/3/2017,Snow,28,2
1/4/2017,Snow,24,7
1/5/2017,Rain,32,4
1/6/2017,Sunny,31,2


Using the command as it is above returns a copy of the original df, with our changes present, but it doesn't change the original df. This is done to protect your df from accidents. However, if you now run a command based on the index, Pandas will revert to the original df and almost certainly throw an error.

df02.set_index('day', inplace = True) # We use inplace argument to make the change to the original df and avoid unexpected errors

In [50]:
df02.loc['1/3/2017']

event          Snow
temperature      28
windspeed         2
Name: 1/3/2017, dtype: object

The advantage of having a 'proper' index is that we can run specialised functions and get very specific pieces of information returned

In [51]:
df02.reset_index(inplace=True) # This resets the index to the original one
df02

Unnamed: 0,day,event,temperature,windspeed
0,1/1/2017,Rain,32,6
1,1/2/2017,Sunny,35,7
2,1/3/2017,Snow,28,2
3,1/4/2017,Snow,24,7
4,1/5/2017,Rain,32,4
5,1/6/2017,Sunny,31,2


You can pretty much set any column as an index

In [53]:
df02.set_index('event', inplace = True)
df02

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rain,1/1/2017,32,6
Sunny,1/2/2017,35,7
Snow,1/3/2017,28,2
Snow,1/4/2017,24,7
Rain,1/5/2017,32,4
Sunny,1/6/2017,31,2


The problem with this is that we have some entries repeated, so if we wanted to print the rows with an index od snow...

In [55]:
df02.loc['Snow'] # Return all rows where the index says snow

Unnamed: 0_level_0,day,temperature,windspeed
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Snow,1/3/2017,28,2
Snow,1/4/2017,24,7


Not a problem for this df, and actually it works well, but you might want an index with unique values which is why numbers are used by default and why dates make such a good alternative although these can be subject to the same issues