### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

### Section 1:  Selecting Data With Pandas

**1). What is the average age of all passengers on board?**

In [76]:
import numpy as np 
import pandas as pd
df = pd.read_csv(r"/Users/emilylam/Desktop/DAT/repo121/Lectures/Unit2/data/titanic.csv")

**2). What are the median values of the Fare & SibSp columns?**

In [14]:
median_sibsp = df['SibSp'].median()
median_fare = df['Fare'].median()
print(median_sibsp)
print(median_fare)

0.0
14.4542


**3). What was the maximum fare paid among the first 100 passengers on board? (This would be the first 100 rows)**

In [42]:
df['Fare'][:100].max()


263.0

**4). What is the modal value of the last 4 columns in the dataset?**

In [36]:
#NaN is if there's no answer
df.iloc[: , -4:].mode()

Unnamed: 0,Ticket,Fare,Cabin,Embarked
0,1601,8.05,B96 B98,S
1,347082,,C23 C25 C27,
2,CA. 2343,,G6,


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [37]:
df.iloc[:250 , :3].mean()


PassengerId    125.500
Survived         0.344
Pclass           2.416
dtype: float64

### Section II: Selecting Based on Conditions

**1). How many females were on board the titanic? Men?**

In [91]:
women = df[df['Sex']=='female'].shape[0]
men = df[df['Sex']=='male'].shape[0]

print("women:", women)
print("men:",men)

women: 314
men: 577


**2). What was the survival rate for females on the titanic? Men?**

In [132]:
#['Survived'].mean() attached will select all the survived columns of those queried and find the mean.
women_survival= df[df.Sex == 'female']['Survived'].mean()
men_survival= df[df.Sex == 'male']['Survived'].mean()

print("women",women_survival)
print("men",men_survival)

women 0.7420382165605095
men 0.18890814558058924


**3). What was the survival rate for people in either passenger class 1 or 2?**

In [130]:
#this creates a dataframe with Pclass = 1 or 2, then from this set, finds the mean of the columns of 'survived'
#this will show the average survival (0 or 1) of the sliced dataframe. 
df[(df['Pclass']==1)|(df['Pclass']==2)]['Survived'].mean()

0.5575

**4). Were woman more likely to survive if they were traveling without siblings?**

In [143]:
#without siblings
withoutsiblings = df[(df['Sex']=='female')&(df['SibSp']==0)]['Survived'].mean()

#with siblings
withsiblings = df[(df['Sex']=='female')&(df['SibSp']>0)]['Survived'].mean()
print("without siblings", withoutsiblings)
print("with siblings", withsiblings)

without siblings 0.7873563218390804
with siblings 0.6857142857142857


**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [149]:
#df.isnull() would return a df with each possible cell & t/f
#.sum() will sum each column
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [329]:
df['Cabin'].value_counts()


G6             4
C23 C25 C27    4
B96 B98        4
F33            3
C22 C26        3
              ..
C7             1
C101           1
D10 D12        1
A31            1
C70            1
Name: Cabin, Length: 147, dtype: int64

In [330]:
#value_counts() is for series, not a df, so you cannot do df.value_counts()
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

**3). Can you find the column with the highest number of unique values?**

To use: `pd.Series.nunique`, and `df.sort_values()` if you want to sort it.

In [178]:
# your answer here
df.nunique().sort_values(ascending=False)

Name           891
PassengerId    891
Ticket         681
Fare           248
Cabin          147
Age             88
Parch            7
SibSp            7
Embarked         3
Pclass           3
Sex              2
Survived         2
dtype: int64

In [336]:
#this can also be done on  an individual column
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [339]:
#to make the previous part a list, add .tolist() 
df['Embarked'].unique().tolist()

['S', 'C', 'Q', nan]

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [190]:
#empty_columns returns a list of each column and t/f 
#df.loc accepts true or false for each column and prints each column together. 
empty_columns = df.isnull().sum()>0
df.loc[:,empty_columns]

Unnamed: 0,Age,Cabin,Embarked
0,22.0,,S
1,38.0,C85,C
2,26.0,,S
3,35.0,C123,S
4,35.0,,S
...,...,...,...
886,27.0,,S
887,19.0,B42,S
888,,,S
889,26.0,C148,C


**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [340]:
#axis=1 means it's checkign the rows
#axis=0 means it's checking the columns
#query is checkig if any of the rows in the column are null, it is true. 

query=df.isnull().any(axis=1)

#since we found the null rows, we select the data frame without the query
df[~query]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [203]:
# your answer here
no_null = df.notnull().all(axis=1)
df.loc[no_null]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


**6).  Can you sort passengers according to how much they paid for a ticket?**

To use: `df.sort_values()`

In [209]:
# your answer here
df.sort_values(by="Fare", ascending=True)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0000,,S
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0000,,S
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0000,,S
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S
277,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0000,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0000,C23 C25 C27,S
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C


**7). Can you sort passengers according to how much they paid for a ticket, within each port of embarkment?**  

ie, sort the rows so that the passengers who embarked from port `C` are listed first, and then within port `C` everyone is sorted by how much they paid for a ticket.

To use: `df.sort_values()`

In [216]:
#to sort by multiple columns, put values in list 
#by=[LIST]
df.sort_values(by=['Embarked','Fare'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
378,379,0,3,"Betros, Mr. Tannous",male,20.0,0,0,2648,4.0125,,C
843,844,0,3,"Lemberopolous, Mr. Peter L",male,34.5,0,0,2683,6.4375,,C
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
203,204,0,3,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.2250,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0000,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0000,C23 C25 C27,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0000,C23 C25 C27,S
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0000,B28,


**8). If people traveled in a group they had the same ticket number.  Can you query your dataframe to return the tickets values that occurred more than once?  Ie, run a line in pandas that returns *a list* of ticket values that occurred more than once, not an entire dataframe.**

To use: there are a few methods you can use, but try `df.duplicated()`, along with `df.unique()`.  **Hint:** You can test for duplicated values on specific columns.

In [345]:
query = df.duplicated(subset='Ticket')
uniquetickets=df[query]['Ticket'].unique()

uniquetickets

array(['349909', 'CA 2144', '19950', '11668', '347082', 'S.O.C. 14879',
       '237736', '35281', '2651', '113803', 'W./C. 6608', '3101295',
       '347088', '1601', '382652', '347742', 'CA. 2343', '347077',
       '230080', 'PC 17569', '3101278', '4133', '36973', '2665', '347054',
       'LINE', 'PC 17558', '113781', '244367', '248738', 'PC 17760',
       '363291', '367226', 'PC 17582', '345764', '113776', '16966',
       '349237', '113505', '370365', 'PC 17604', '113789', '35273',
       'PP 9549', 'STON/O2. 3101279', '19928', '239853', '370129',
       '113760', '29106', 'F.C.C. 13529', '250644', 'C.A. 34651', '2666',
       '110465', '11967', '19943', 'C.A. 37671', '2627', '110152',
       'PC 17758', '371110', '111361', '26360', '2668', 'PC 17761',
       '2908', 'C.A. 33112', '17421', 'PC 17757', '110413', '13507',
       '28403', '36947', '345773', 'PC 17485', '243847', 'SC/Paris 2123',
       '367230', '347080', 'A/5. 3336', '230136', '2653', '13502',
       'C.A. 31921', '3765

In [300]:
ticket_dup= df[df['Ticket'].duplicated()]
unique_dupes=ticket_dup.Ticket.unique()
unique_dupes

array(['349909', 'CA 2144', '19950', '11668', '347082', 'S.O.C. 14879',
       '237736', '35281', '2651', '113803', 'W./C. 6608', '3101295',
       '347088', '1601', '382652', '347742', 'CA. 2343', '347077',
       '230080', 'PC 17569', '3101278', '4133', '36973', '2665', '347054',
       'LINE', 'PC 17558', '113781', '244367', '248738', 'PC 17760',
       '363291', '367226', 'PC 17582', '345764', '113776', '16966',
       '349237', '113505', '370365', 'PC 17604', '113789', '35273',
       'PP 9549', 'STON/O2. 3101279', '19928', '239853', '370129',
       '113760', '29106', 'F.C.C. 13529', '250644', 'C.A. 34651', '2666',
       '110465', '11967', '19943', 'C.A. 37671', '2627', '110152',
       'PC 17758', '371110', '111361', '26360', '2668', 'PC 17761',
       '2908', 'C.A. 33112', '17421', 'PC 17757', '110413', '13507',
       '28403', '36947', '345773', 'PC 17485', '243847', 'SC/Paris 2123',
       '367230', '347080', 'A/5. 3336', '230136', '2653', '13502',
       'C.A. 31921', '3765

In [262]:
#FROM SOLUTIONS (same answer)
query = df.duplicated(subset='Ticket')
unique_tickets = df[query]['Ticket'].unique()

unique_tickets

array(['349909', 'CA 2144', '19950', '11668', '347082', 'S.O.C. 14879',
       '237736', '35281', '2651', '113803', 'W./C. 6608', '3101295',
       '347088', '1601', '382652', '347742', 'CA. 2343', '347077',
       '230080', 'PC 17569', '3101278', '4133', '36973', '2665', '347054',
       'LINE', 'PC 17558', '113781', '244367', '248738', 'PC 17760',
       '363291', '367226', 'PC 17582', '345764', '113776', '16966',
       '349237', '113505', '370365', 'PC 17604', '113789', '35273',
       'PP 9549', 'STON/O2. 3101279', '19928', '239853', '370129',
       '113760', '29106', 'F.C.C. 13529', '250644', 'C.A. 34651', '2666',
       '110465', '11967', '19943', 'C.A. 37671', '2627', '110152',
       'PC 17758', '371110', '111361', '26360', '2668', 'PC 17761',
       '2908', 'C.A. 33112', '17421', 'PC 17757', '110413', '13507',
       '28403', '36947', '345773', 'PC 17485', '243847', 'SC/Paris 2123',
       '367230', '347080', 'A/5. 3336', '230136', '2653', '13502',
       'C.A. 31921', '3765

**9). See if you can query a dataframe so that it only returns rows with passengers that are traveling in groups, based on their ticket numbers.**

To use: `df.isin()`, assuming you used the approach suggested in the previous question.

In [346]:
#Checks if rows in df 'Ticket' column match values in ticket_dup list
listof = df['Ticket'].isin(unique_dupes)
df[listof].sort_values('Ticket')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.500,B77,S
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.500,B77,S
504,505,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.500,B79,S
262,263,0,1,"Taussig, Mr. Emil",male,52.0,1,1,110413,79.650,E67,S
558,559,1,1,"Taussig, Mrs. Emil (Tillie Mandelbaum)",female,39.0,1,1,110413,79.650,E67,S
...,...,...,...,...,...,...,...,...,...,...,...,...
736,737,0,3,"Ford, Mrs. Edward (Margaret Ann Watson)",female,48.0,1,3,W./C. 6608,34.375,,S
86,87,0,3,"Ford, Mr. William Neal",male,16.0,1,3,W./C. 6608,34.375,,S
147,148,0,3,"Ford, Miss. Robina Maggie ""Ruby""",female,9.0,2,2,W./C. 6608,34.375,,S
540,541,1,1,"Crosby, Miss. Harriet R",female,36.0,0,2,WE/P 5735,71.000,B22,S


**10).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [348]:
#data types in panda are based on numpy data types. You will also need to import numpy as np. 
#need INCLUDE argument in select_dtypes
#this includes all text-based columns
#include=np.number is EVER NUMERIC-BASED DATA TYPE

df.select_dtypes(include=np.object)

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S
...,...,...,...,...,...
886,"Montvila, Rev. Juozas",male,211536,,S
887,"Graham, Miss. Margaret Edith",female,112053,B42,S
888,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,,S
889,"Behr, Mr. Karl Howell",male,111369,C148,C


**11).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [351]:
df.select_dtypes(include=np.number)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


In [383]:
#adding .columns attribute will return an index list of column names.
#adding .tolist() will make it a regular list and clean it up 

columns = df.select_dtypes(include=np.number).columns
columnslist = df.select_dtypes(include=np.number).columns.tolist()

print(columns)
print(columnslist)

Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')
['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']


In [384]:
df[columnslist]

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.000000,1,0,7.2500
1,2,1,1,38.000000,1,0,71.2833
2,3,1,3,26.000000,0,0,7.9250
3,4,1,1,35.000000,1,0,53.1000
4,5,0,3,35.000000,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.000000,0,0,13.0000
887,888,1,1,19.000000,0,0,30.0000
888,889,0,3,29.699118,1,2,23.4500
889,890,1,1,26.000000,0,0,30.0000


**12). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [385]:
#fills each column null with the column's average
df[columnslist]=df[columnslist].fillna(df[columnslist].mean())
df[columnslist]


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.000000,1,0,7.2500
1,2,1,1,38.000000,1,0,71.2833
2,3,1,3,26.000000,0,0,7.9250
3,4,1,1,35.000000,1,0,53.1000
4,5,0,3,35.000000,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.000000,0,0,13.0000
887,888,1,1,19.000000,0,0,30.0000
888,889,0,3,29.699118,1,2,23.4500
889,890,1,1,26.000000,0,0,30.0000


In [386]:
df[columnslist].isnull().sum() #checks for null values, should be all 0

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64