# Lab : Pandas and Dataframes 

Pandas is the bread and butter for a Data Analyst or data scientists. We use pandas to load and transform data. Typically pandas is used to perform all the the exploratory data analysis. 

In this lab we will be working on multiple datasets. 

Part I: We will use the Iris Dataset, in part 2 we will use the grad student dataset, in part 3 we will use the titanic dataset. 

Lets start with the Iris dataset. 

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
import random
import numpy as np


iris_dataset = load_iris()
target = list(iris_dataset.target) 
dataset = list(iris_dataset.data)
target_names = list(iris_dataset.target_names)
feature_names = iris_dataset.feature_names


In [2]:
# Using the variable dataset make a dataframe. Remember when 
# when you have a list you need to do df = pd.DataFrame(data=list, columns=column_names)
# where column_names is a list 
iris_df = pd.DataFrame(data=dataset, columns=feature_names)

# Store the head of the iris_df, by deafualt the head stores 5 rows of data 
iris_head = iris_df.head()

# You should get the exact values as show below 
print("The head of the iris dataset is \n{}".format(iris_head))

The head of the iris dataset is 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2


The head of the iris dataset is <br>
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm) <br>
0                5.1               3.5                1.4               0.2 <br>
1                4.9               3.0                1.4               0.2 <br>
2                4.7               3.2                1.3               0.2 <br>
3                4.6               3.1                1.5               0.2 <br>
4                5.0               3.6                1.4               0.2 <br>


In [3]:
# Next create a dataframe for the variable target. This is similar to what you did above
target_df = pd.DataFrame(data=target, columns=['target'])

print("Head of target dataframe is : \n {}".format(target_df.head()))

Head of target dataframe is : 
    target
0       0
1       0
2       0
3       0
4       0


Head of target dataframe is : <br>
    target <br>
0       0 <br>
1       0 <br>
2       0 <br>
3       0 <br>
4       0 <br>

Next we want to combine both the dataframes ```iris_df``` and ```target_df```. There are multiple ways of doing this in pandas. We will use the concat function here. You can read about it [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [4]:
# Use the concat function combine iris_df and target_df. Make sure you check 
# the axis along which you are combining the dataframes 
iris_combined_df = pd.concat([iris_df, target_df], axis=1) 

print("Shape of the iris combine dataframe : {}".format(iris_combined_df.shape))

Shape of the iris combine dataframe : (150, 5)


Shape of the iris combine dataframe : (150, 5)


In [5]:
# Next print the values in the 3rd row of the iris_combined_df 
third_row = iris_combined_df.iloc[2,:]

print("Third row values : \n{}".format(third_row))

Third row values : 
sepal length (cm)    4.7
sepal width (cm)     3.2
petal length (cm)    1.3
petal width (cm)     0.2
target               0.0
Name: 2, dtype: float64


Third row values : <br>
sepal length (cm)    4.7 <br>
sepal width (cm)     3.2 <br>
petal length (cm)    1.3 <br>
petal width (cm)     0.2 <br>
target               0.0 <br>
Name: 2, dtype: float64 <br>


In [6]:
# Suppose we want the third column in the dataset. Using iloc can you get the third column
third_column = iris_combined_df.iloc[:,2]
print("Print the head of the third column is : \n{}".format(third_column.head()))

Print the head of the third column is : 
0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal length (cm), dtype: float64


Print the head of the third column is : <br>
0    1.4 <br>
1    1.4 <br>
2    1.3 <br>
3    1.5 <br>
4    1.4 <br>
Name: petal length (cm), dtype: float64 <br>

Now suppose you want to change a single value in the dataframe i.e the value in a single cell of the dataframe. How would you do that? 

In [7]:
# Change the value of target in the 3rd row in the iris_combined_df from 0 to setosa 
iris_combined_df.loc[3,'target'] = 'setosa'

print("Print the third row of iris dataset : \n{}".format(iris_combined_df.iloc[3,:]))

Print the third row of iris dataset : 
sepal length (cm)       4.6
sepal width (cm)        3.1
petal length (cm)       1.5
petal width (cm)        0.2
target               setosa
Name: 3, dtype: object


Print the third row of iris dataset : <br>
sepal length (cm)       4.6 <br>
sepal width (cm)        3.1 <br>
petal length (cm)       1.5 <br>
petal width (cm)        0.2 <br>
target               setosa <br>
Name: 3, dtype: object <br>


Next up, let see how we can apply an operation the entire row or an entire column of a dataframe.  The next part is a bit on the challenging side since you will have to use a lambda function to modify the column. Our first task is that we have to convert all the values in the target column to names. All the values in target are either 0 1, or 2 save the third row which we changed to 'setosa'. Hence we now have a column which contains integer values 0,1,2 and string 'setosa' 

To get around this, we will use the apply function to the target column. Take a look at it (here)[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html]. Remember, we are applying this only to the target column not the rest of them. The argument of the apply function will be a lambda function like this: 

lambda x: use names_dict[x] if  string of x is a digit, if not return x

Now, you will add this lambda function as an argument into apply. You can use isdigit() to check if the string is a number.

In [14]:
names_dict = {0: target_names[0], 1:target_names[1], 2: target_names[2]}

# Convert all the values in the target column to names.
# you can use the names in target_names for this purpose
iris_combined_df['target'] = iris_combined_df['target'].apply(lambda x: names_dict[x] if str(x).isdigit() else x)

print("Head of the target column : \n{}".format(iris_combined_df['target'].head()))

Head of the target column : 
0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: target, dtype: object


Head of the target column : <br> 
0    setosa  <br> 
1    setosa  <br> 
2    setosa  <br> 
3    setosa  <br> 
4    setosa  <br> 
Name: target, dtype: object  <br> 


## Simplify the Grad Students Dataset

Given a dataset about Graduate Students and their college majors and other statistics about the number of graduates, number employed etc, write python code to extract the following infromation from the dataset and write to a output file - "grad_simplified.csv". Use pandas library (Dataframe) for this purpose.

The criteria for output extraction is:

1. Only include the columns: Major_Code, Major, Major_Category, Grad_employes and Grad_unemployed
2. Only include the Majors that have either 'ENGINEERING', 'TECHNOLOGIES' or 'SCIENCE' (case insensitive) in its major  field.
3. Exclude any Major_category that has 'Social Science' or 'Arts' in the name of Major_category.




In [None]:
# Import the library
import pandas as pd

grad_url = "https://raw.githubusercontent.com/colaberry/538data/master/college-majors/grad-students.csv"
grad_students = pd.read_csv(grad_url)

#Write your code here or in other code cells down

In [None]:

grad_students