# Lab : Pandas and Dataframes 

Pandas is the bread and butter for a Data Analyst or data scientists. We use pandas to load and transform data. Typically pandas is used to perform all the the exploratory data analysis. 

In this lab we will be working on two datasets. 

Part I: We will use the Iris Dataset, in part 2 we will use the grad student dataset, in part 3 we will use the titanic dataset. 

Lets start with the Iris dataset. 

In [33]:
import pandas as pd
from sklearn.datasets import load_iris
import random
import numpy as np

# IRIS DATASET 
iris_dataset = load_iris()
target = list(iris_dataset.target) 
dataset = list(iris_dataset.data)
target_names = list(iris_dataset.target_names)
feature_names = iris_dataset.feature_names


# HEART DATASET 
heart_loc = "../../../data/heart.csv"


In [2]:
# Using the variable dataset make a dataframe. Remember when 
# when you have a list you need to do df = pd.DataFrame(data=list, columns=column_names)
# where column_names is a list 
iris_df = pd.DataFrame(data=dataset, columns=feature_names)

# Store the head of the iris_df, by deafualt the head stores 5 rows of data 
iris_head = iris_df.head()

# You should get the exact values as show below 
print("The head of the iris dataset is \n{}".format(iris_head))

The head of the iris dataset is 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2


The head of the iris dataset is <br>
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm) <br>
0                5.1               3.5                1.4               0.2 <br>
1                4.9               3.0                1.4               0.2 <br>
2                4.7               3.2                1.3               0.2 <br>
3                4.6               3.1                1.5               0.2 <br>
4                5.0               3.6                1.4               0.2 <br>


In [3]:
# Next create a dataframe for the variable target. This is similar to what you did above
target_df = pd.DataFrame(data=target, columns=['target'])

print("Head of target dataframe is : \n {}".format(target_df.head()))

Head of target dataframe is : 
    target
0       0
1       0
2       0
3       0
4       0


Head of target dataframe is : <br>
    target <br>
0       0 <br>
1       0 <br>
2       0 <br>
3       0 <br>
4       0 <br>

Next we want to combine both the dataframes ```iris_df``` and ```target_df```. There are multiple ways of doing this in pandas. We will use the concat function here. You can read about it [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [4]:
# Use the concat function combine iris_df and target_df. Make sure you check 
# the axis along which you are combining the dataframes 
iris_combined_df = pd.concat([iris_df, target_df], axis=1) 

print("Shape of the iris combine dataframe : {}".format(iris_combined_df.shape))

Shape of the iris combine dataframe : (150, 5)


Shape of the iris combine dataframe : (150, 5)


In [5]:
# Next print the values in the 3rd row of the iris_combined_df 
third_row = iris_combined_df.iloc[2,:]

print("Third row values : \n{}".format(third_row))

Third row values : 
sepal length (cm)    4.7
sepal width (cm)     3.2
petal length (cm)    1.3
petal width (cm)     0.2
target               0.0
Name: 2, dtype: float64


Third row values : <br>
sepal length (cm)    4.7 <br>
sepal width (cm)     3.2 <br>
petal length (cm)    1.3 <br>
petal width (cm)     0.2 <br>
target               0.0 <br>
Name: 2, dtype: float64 <br>


In [6]:
# Suppose we want the third column in the dataset. Using iloc can you get the third column
third_column = iris_combined_df.iloc[:,2]
print("Print the head of the third column is : \n{}".format(third_column.head()))

Print the head of the third column is : 
0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal length (cm), dtype: float64


Print the head of the third column is : <br>
0    1.4 <br>
1    1.4 <br>
2    1.3 <br>
3    1.5 <br>
4    1.4 <br>
Name: petal length (cm), dtype: float64 <br>

Now suppose you want to change a single value in the dataframe i.e the value in a single cell of the dataframe. How would you do that? 

In [7]:
# Change the value of target in the 3rd row in the iris_combined_df from 0 to setosa 
iris_combined_df.loc[3,'target'] = 'setosa'

print("Print the third row of iris dataset : \n{}".format(iris_combined_df.iloc[3,:]))

Print the third row of iris dataset : 
sepal length (cm)       4.6
sepal width (cm)        3.1
petal length (cm)       1.5
petal width (cm)        0.2
target               setosa
Name: 3, dtype: object


Print the third row of iris dataset : <br>
sepal length (cm)       4.6 <br>
sepal width (cm)        3.1 <br>
petal length (cm)       1.5 <br>
petal width (cm)        0.2 <br>
target               setosa <br>
Name: 3, dtype: object <br>


Next up, let see how we can apply an operation the entire row or an entire column of a dataframe.  The next part is a bit on the challenging side since you will have to use a lambda function to modify the column. Our first task is that we have to convert all the values in the target column to names. All the values in target are either 0 1, or 2 save the third row which we changed to 'setosa'. Hence we now have a column which contains integer values 0,1,2 and string 'setosa' 

To get around this, we will use the apply function to the target column. Take a look at it (here)[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html]. Remember, we are applying this only to the target column not the rest of them. The argument of the apply function will be a lambda function like this: 

lambda x: use names_dict[x] if  string of x is a digit, if not return x

Now, you will add this lambda function as an argument into apply. You can use isdigit() to check if the string is a number.

In [8]:
names_dict = {0: target_names[0], 1:target_names[1], 2: target_names[2]}

# Convert all the values in the target column to names.
# you can use the names in target_names for this purpose
iris_combined_df['target'] = iris_combined_df['target'].apply(lambda x: names_dict[x] if str(x).isdigit() else x)

print("Head of the target column : \n{}".format(iris_combined_df['target'].head()))

Head of the target column : 
0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: target, dtype: object


Head of the target column : <br> 
0    setosa  <br> 
1    setosa  <br> 
2    setosa  <br> 
3    setosa  <br> 
4    setosa  <br> 
Name: target, dtype: object  <br> 


The method above may not be the most elegant way of doing this. You may find other ways that are better or eaiser. Next lets looks at applying an operation to the entire row. 

Suppose we want to take the square of all the elements in the first row. How would you do it? 

In [31]:
iris_copy  = iris_combined_df.copy()

# Use iris copy and square the values of the first row
# In order to do this you need to use iloc to select
# the first row and 4 columns. To this you need to apply 
# a lambda function that will square the values
squared_values = iris_copy.iloc[0, 0:4].apply(lambda x : x*x)

print("Squared values of the first row are : \n{}".format(squared_values))

Squared values of the first row are : 
sepal length (cm)    26.01
sepal width (cm)     12.25
petal length (cm)     1.96
petal width (cm)      0.04
Name: 0, dtype: float64


Squared values of the first row are : <br> 
sepal length (cm)    26.01 <br>
sepal width (cm)     12.25 <br>
petal length (cm)     1.96 <br>
petal width (cm)      0.04 <br>
Name: 0, dtype: float64 <br>

## Part 2  
Next up we will be using the heart dataset. Details on the dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/heart+disease). The goal of the dataset is to predict the if a person has heart disease or not, to this end there are 13 features that can be used for this purpose. These features are a mix of continuous and categorical. Here we will explore the data using pandas

Lets first start by import the data

In [35]:
# Import the heart dataset by reading the csv file. the variable heart_loc contains path to the file
heart_df = pd.read_csv(heart_loc)

heart_head = heart_df.head()

print("Head of the heart dataset: \n{}".format(heart_head))

Head of the heart dataset: 
   Unnamed: 0  Age  Sex     ChestPain  RestBP  Chol  Fbs  RestECG  MaxHR  \
0           1   63    1       typical     145   233    1        2    150   
1           2   67    1  asymptomatic     160   286    0        2    108   
2           3   67    1  asymptomatic     120   229    0        2    129   
3           4   37    1    nonanginal     130   250    0        0    187   
4           5   41    0    nontypical     130   204    0        2    172   

   ExAng  Oldpeak  Slope   Ca        Thal  AHD  
0      0      2.3      3  0.0       fixed   No  
1      1      1.5      2  3.0      normal  Yes  
2      1      2.6      2  2.0  reversable  Yes  
3      0      3.5      3  0.0      normal   No  
4      0      1.4      1  0.0      normal   No  


Head of the heart dataset: <br>
   Unnamed: 0  Age  Sex     ChestPain  RestBP  Chol  Fbs  RestECG  MaxHR  \ <br>
0           1   63    1       typical     145   233    1        2    150   <br>
1           2   67    1  asymptomatic     160   286    0        2    108   <br>
2           3   67    1  asymptomatic     120   229    0        2    129   <br>
3           4   37    1    nonanginal     130   250    0        0    187   <br>
4           5   41    0    nontypical     130   204    0        2    172   <br>

   ExAng  Oldpeak  Slope   Ca        Thal  AHD  <br>
0      0      2.3      3  0.0       fixed   No  <br>
1      1      1.5      2  3.0      normal  Yes  <br>
2      1      2.6      2  2.0  reversable  Yes  <br>
3      0      3.5      3  0.0      normal   No  <br>
4      0      1.4      1  0.0      normal   No <br>

When you are dealing with a categorical variable it is a good idea to a count of how many rows of data you have for each category. You can easily do this by using the [```value_counts```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)(click it to go to page) function in pandas. You apply value counts to a single column from the dataset. 

In [38]:
# Use the value counts to get the number of rows in checkPain column of the hearts dataset. 
ChestPain_counts = heart_df['ChestPain'].value_counts()

print("Value counts of Chest Pain variable : \n{}".format(ChestPain_counts))

Value counts of Chest Pain varaible : 
asymptomatic    144
nonanginal       86
nontypical       50
typical          23
Name: ChestPain, dtype: int64


Value counts of Chest Pain variable : <br>
asymptomatic    144 <br>
nonanginal       86 <br>
nontypical       50 <br>
typical          23 <br>
Name: ChestPain, dtype: int64 <br>

In [40]:
# Suppose we want just the different categories and not the values how can we extra them? 
# Hint: You can get the index of ChestPain_counts and convert it to a list
chestpain_categories = list(ChestPain_counts.index) 

print("List of categories in the Categorical variable ChestPain : \n{}".format(chestpain_categories))

List of categories in the Categorical variable ChestPain : 
['asymptomatic', 'nonanginal', 'nontypical', 'typical']


List of categories in the Categorical variable ChestPain : 
['asymptomatic', 'nonanginal', 'nontypical', 'typical']


In [41]:
# There may be situations that you may want to convert 

{'asymptomatic': 144, 'nonanginal': 86, 'nontypical': 50, 'typical': 23}

## Simplify the Grad Students Dataset

Given a dataset about Graduate Students and their college majors and other statistics about the number of graduates, number employed etc, write python code to extract the following infromation from the dataset and write to a output file - "grad_simplified.csv". Use pandas library (Dataframe) for this purpose.

The criteria for output extraction is:

1. Only include the columns: Major_Code, Major, Major_Category, Grad_employes and Grad_unemployed
2. Only include the Majors that have either 'ENGINEERING', 'TECHNOLOGIES' or 'SCIENCE' (case insensitive) in its major  field.
3. Exclude any Major_category that has 'Social Science' or 'Arts' in the name of Major_category.




In [9]:
# Import the library
import pandas as pd

grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

#Write your code here or in other code cells down

In [10]:

grad_students

Unnamed: 0,Major_code,Major,Major_category,Grad_total,Grad_sample_size,Grad_employed,Grad_full_time_year_round,Grad_unemployed,Grad_unemployment_rate,Grad_median,...,Nongrad_total,Nongrad_employed,Nongrad_full_time_year_round,Nongrad_unemployed,Nongrad_unemployment_rate,Nongrad_median,Nongrad_P25,Nongrad_P75,Grad_share,Grad_premium
0,5601,CONSTRUCTION SERVICES,Industrial Arts & Consumer Services,9173,200,7098,6511,681,0.087543,75000.0,...,86062,73607,62435,3928,0.050661,65000.0,47000,98000.0,0.096320,0.153846
1,6004,COMMERCIAL ART AND GRAPHIC DESIGN,Arts,53864,882,40492,29553,2482,0.057756,60000.0,...,461977,347166,250596,25484,0.068386,48000.0,34000,71000.0,0.104420,0.250000
2,6211,HOSPITALITY MANAGEMENT,Business,24417,437,18368,14784,1465,0.073867,65000.0,...,179335,145597,113579,7409,0.048423,50000.0,35000,75000.0,0.119837,0.300000
3,2201,COSMETOLOGY SERVICES AND CULINARY ARTS,Industrial Arts & Consumer Services,5411,72,3590,2701,316,0.080901,47000.0,...,37575,29738,23249,1661,0.052900,41600.0,29000,60000.0,0.125878,0.129808
4,2001,COMMUNICATION TECHNOLOGIES,Computers & Mathematics,9109,171,7512,5622,466,0.058411,57000.0,...,53819,43163,34231,3389,0.072800,52000.0,36000,78000.0,0.144753,0.096154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,5203,COUNSELING PSYCHOLOGY,Psychology & Social Work,51812,724,38468,28808,1420,0.035600,50000.0,...,16781,12377,8502,835,0.063200,40000.0,25000,50000.0,0.755354,0.250000
169,5202,CLINICAL PSYCHOLOGY,Psychology & Social Work,22716,355,16612,12022,782,0.044958,70000.0,...,6519,4368,3033,357,0.075556,46000.0,30000,70000.0,0.777014,0.521739
170,6106,HEALTH AND MEDICAL PREPARATORY PROGRAMS,Health,114971,1766,78132,58825,1732,0.021687,135000.0,...,26320,16221,12185,1012,0.058725,51000.0,35000,87000.0,0.813718,1.647059
171,2303,SCHOOL STUDENT COUNSELING,Education,19841,260,11313,8130,613,0.051400,56000.0,...,2232,1328,980,169,0.112892,42000.0,27000,51000.0,0.898881,0.333333
