# Exercises_05 - Pandas exercises

This week we will work with the `pandas` library for data analysis.
The reference guide for numpy can be found here: https://pandas.pydata.org/docs/. 

## Revisiting the BMI dataset

Last week we utilised numpy and represented a BMI dataset (from Kaggle) in text form.
Whilst it does populate the numpy array with the data, the formatting is harder to read with the 'b' prefix and misalignment of columns.


In [31]:
import numpy as np
import os

# Import `height-weight` keeping the text column intact.
url = os.path.join('..', 'datasets', '500_Person_Gender_Height_Weight_Index.csv')
hw_dataset = np.genfromtxt(url, delimiter=',', names=True, dtype='object')
hw_dataset

array([(b'Male', b'174', b'96', b'4'), (b'Male', b'189', b'87', b'2'),
       (b'Female', b'185', b'110', b'4'),
       (b'Female', b'195', b'104', b'3'), (b'Male', b'149', b'61', b'3'),
       (b'Male', b'189', b'104', b'3'), (b'Male', b'147', b'92', b'5'),
       (b'Male', b'154', b'111', b'5'), (b'Male', b'174', b'90', b'3'),
       (b'Female', b'169', b'103', b'4'), (b'Male', b'195', b'81', b'2'),
       (b'Female', b'159', b'80', b'4'),
       (b'Female', b'192', b'101', b'3'), (b'Male', b'155', b'51', b'2'),
       (b'Male', b'191', b'79', b'2'), (b'Female', b'153', b'107', b'5'),
       (b'Female', b'157', b'110', b'5'), (b'Male', b'140', b'129', b'5'),
       (b'Male', b'144', b'145', b'5'), (b'Male', b'172', b'139', b'5'),
       (b'Male', b'157', b'110', b'5'), (b'Female', b'153', b'149', b'5'),
       (b'Female', b'169', b'97', b'4'), (b'Male', b'185', b'139', b'5'),
       (b'Female', b'172', b'67', b'2'), (b'Female', b'151', b'64', b'3'),
       (b'Male', b'190', b'95', b'

# Exercise 1
Now we are going to utilise pandas' methods to represent this data in a data frame (df). By default the formatting of the output will be easier to read. We can also utilise pandas methods to perform data analysis.

Start by running the code below to check that bmi dataframe has been populated by 'reading the csv'.

In [37]:
import pandas as pd
bmi_df = pd.read_csv("../datasets/500_Person_Gender_Height_Weight_Index.csv")
bmi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Gender  500 non-null    object
 1   Height  500 non-null    int64 
 2   Weight  500 non-null    int64 
 3   Index   500 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 15.8+ KB


Now utilise pandas' sample method to show only five rows of the BMI data set.
Play around with the parameters of the sample method, and observe the results. 

From the documentation (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html):


DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

In [40]:
bmi_df.sample(4)

Unnamed: 0,Gender,Height,Weight,Index
441,Male,182,73,2
131,Female,187,70,2
313,Female,179,67,2
378,Female,154,96,5


# Exercise 2 - Titanic dataset

In [2]:
titanic_df = pd.read_csv("../datasets/titanic-dataset.csv")

In [3]:
import os


In [4]:
os.path.abspath(os.path.join("..", "..", "..", "GitHub/PP4DSP1-test/datasets", "titanic-dataset.csv"))

'/Users/nick/Documents/GitHub/PP4DSP1-test/datasets/titanic-dataset.csv'

In [6]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
titanic_df["Age Filled"] = titanic_df.Age.fillna(titanic_df.Age.median())

In [15]:
titanic_df.sample(10).

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Filled
530,531,1,2,"Quick, Miss. Phyllis May",female,2.0,1,1,26360,26.0,,S,2.0
591,592,1,1,"Stephenson, Mrs. Walter Bertram (Martha Eustis)",female,52.0,1,0,36947,78.2667,D20,C,52.0
129,130,0,3,"Ekstrom, Mr. Johan",male,45.0,0,0,347061,6.975,,S,45.0
319,320,1,1,"Spedden, Mrs. Frederic Oakley (Margaretta Corn...",female,40.0,1,1,16966,134.5,E34,C,40.0
475,476,0,1,"Clifford, Mr. George Quincy",male,,0,0,110465,52.0,A14,S,28.0
84,85,1,2,"Ilett, Miss. Bertha",female,17.0,0,0,SO/C 14885,10.5,,S,17.0
748,749,0,1,"Marvin, Mr. Daniel Warner",male,19.0,1,0,113773,53.1,D30,S,19.0
870,871,0,3,"Balkic, Mr. Cerin",male,26.0,0,0,349248,7.8958,,S,26.0
496,497,1,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,1,0,36947,78.2667,D20,C,54.0
656,657,0,3,"Radeff, Mr. Alexander",male,,0,0,349223,7.8958,,S,28.0


In [16]:
titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Age Filled       0
dtype: int64

In [19]:
first_t = titanic_df[["Survived", "Pclass"]]
middle_t = titanic_df[["Parch","Cabin"]]
end_t = titanic_df[["Name", "Age"]]

In [20]:
middle_t

Unnamed: 0,Parch,Cabin
0,0,
1,0,C85
2,0,
3,0,C123
4,0,
...,...,...
886,0,
887,0,B42
888,2,
889,0,C148


In [29]:
pd.concat([first_t, end_t], axis=0)

Unnamed: 0,Survived,Pclass,Name,Age
0,0.0,3.0,,
1,1.0,1.0,,
2,1.0,3.0,,
3,1.0,1.0,,
4,0.0,3.0,,
...,...,...,...,...
886,,,"Montvila, Rev. Juozas",27.0
887,,,"Graham, Miss. Margaret Edith",19.0
888,,,"Johnston, Miss. Catherine Helen ""Carrie""",
889,,,"Behr, Mr. Karl Howell",26.0


In [None]:
pd.Series()

In [None]:
pd