### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

### Section 1:  Selecting Data With Pandas

**1). What is the average age of all passengers on board?**

In [2]:
import pandas as pd
df = pd.read_csv("/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit2/data/titanic.csv")

In [3]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
df.loc[:,'Age'].mean()

29.69911764705882

In [5]:
df['Age'].mean()

29.69911764705882

**2). What are the median values of the Fare & SibSp columns?**

In [6]:
df.loc[:,['Fare', 'SibSp']].median()

Fare     14.4542
SibSp     0.0000
dtype: float64

In [7]:
df[['Fare', 'SibSp']].median()
#Simpler

Fare     14.4542
SibSp     0.0000
dtype: float64

**3). What was the maximum fare paid among the first 100 passengers on board? (This would be the first 100 rows)**

In [8]:
df.loc[:100, 'Fare'].max()

263.0

In [9]:
df['Fare'][:100].max()

263.0

**4). What is the modal value of the last 4 columns in the dataset?**

In [10]:
df.iloc[:, -4:].mode()
# No corresponding empty values for Fare and Embarked. Cabin and Ticket have a tie for mode.

Unnamed: 0,Ticket,Fare,Cabin,Embarked
0,1601,8.05,B96 B98,S
1,347082,,C23 C25 C27,
2,CA. 2343,,G6,


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [11]:
df.iloc[:250, :3].mean()

PassengerId    125.500
Survived         0.344
Pclass           2.416
dtype: float64

### Section II: Selecting Based on Conditions

**1). How many females were on board the titanic? Men?**

In [12]:
df[df['Sex'] == 'female'].shape[0]
# Adding the 0 will return the value exactly

314

In [13]:
df[df['Sex'] == 'male'].shape[0]

577

**2). What was the survival rate for females on the titanic? Men?**

In [14]:
df['Survived'].mean()
# The value 1 occured 38% of the time.
# Great when just looking at binaries.

0.3838383838383838

In [15]:
df[df['Sex'] == 'female']['Survived'].mean()

0.7420382165605095

In [16]:
df[df['Sex'] == 'male']['Survived'].mean()

0.18890814558058924

**3). What was the survival rate for people in either passenger class 1 or 2?**

In [17]:
df[((df['Pclass'] == 1) | (df['Pclass'] == 2))]['Survived'].mean()

0.5575

**4). Were woman more likely to survive if they were traveling without siblings?**

In [18]:
df[(df['Sex'] == 'female') & (df['SibSp'] == 0)]['Survived'].mean()
# Yes

0.7873563218390804

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [22]:
df.isnull().sum()
# Shows you how many empty values you have in each column

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [28]:
df['Embarked'].value_counts()
# Shows you the count of unique values

S    644
C    168
Q     77
Name: Embarked, dtype: int64

**3). Can you find the column with the highest number of unique values?**

To use: `pd.Series.nunique`, and `df.sort_values()` if you want to sort it.

In [None]:
# your answer here

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [37]:
df.isnull().sum() > 0

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [38]:
type(df.isnull().sum() > 0)
# This output is a new series
# An index can be basically anything

pandas.core.series.Series

In [40]:
(df.isnull().sum() > 0).index
# These are the index values
# In this case they're column labels
# So we can call this a variable and add it to a .loc function

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [42]:
query = df.isnull().sum() > 0
# Save this condition as a variable

In [43]:
df.loc[:, query]
# Then search off of this
# Only returning the columns where the boolean returned True

Unnamed: 0,Age,Cabin,Embarked
0,22.0,,S
1,38.0,C85,C
2,26.0,,S
3,35.0,C123,S
4,35.0,,S
...,...,...,...
886,27.0,,S
887,19.0,B42,S
888,,,S
889,26.0,C148,C


**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [48]:
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [49]:
the_list = [True, False, True, True, False]

In [51]:
any(the_list)
# If any of the things return true, the whole thing is true

True

In [53]:
all(the_list)
# All of these must be true in order for this to return true

False

In [56]:
df.isnull().any()
# Tells us if any of the columns have empty values
# We want to know if any of the rows have empty values

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [58]:
df.isnull().any(axis = 1)
# For every single row, if there's at least one null value, this will return true

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Length: 891, dtype: bool

In [59]:
query = df.isnull().any(axis = 1)

In [62]:
df[query]
# Good. But we want the inverse of this.
# Right now this is showing rows with a null value.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [69]:
df[~query]
# Where there are NO null values.
# Tilda only works for true/false values.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


**6).  Can you sort passengers according to how much they paid for a ticket?**

To use: `df.sort_values()`

In [67]:
df.sort_values(['Fare'], ascending = False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0000,C23 C25 C27,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
...,...,...,...,...,...,...,...,...,...,...,...,...
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0000,,S
822,823,0,1,"Reuchlin, Jonkheer. John George",male,38.0,0,0,19972,0.0000,,S
732,733,0,2,"Knight, Mr. Robert J",male,,0,0,239855,0.0000,,S


**7). Can you sort passengers according to how much they paid for a ticket, within each port of embarkment?**  

ie, sort the rows so that the passengers who embarked from port `C` are listed first, and then within port `C` everyone is sorted by how much they paid for a ticket.

To use: `df.sort_values()`

In [68]:
df.sort_values(['Embarked', 'Fare'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
378,379,0,3,"Betros, Mr. Tannous",male,20.0,0,0,2648,4.0125,,C
843,844,0,3,"Lemberopolous, Mr. Peter L",male,34.5,0,0,2683,6.4375,,C
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
203,204,0,3,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.2250,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0000,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0000,C23 C25 C27,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0000,C23 C25 C27,S
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0000,B28,


**8). If people traveled in a group they had the same ticket number.  Can you query your dataframe to return the tickets values that occurred more than once?  Ie, run a line in pandas that returns *a list* of ticket values that occurred more than once, not an entire dataframe.**

To use: there are a few methods you can use, but try `df.duplicated()`, along with `df.unique()`.  **Hint:** You can test for duplicated values on specific columns.

In [71]:
df.duplicated()
# The default value for this checks for duplicate values in every single column
# Checking if every row matches on every single value

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

In [72]:
?df.duplicated

In [74]:
df.duplicated(subset=['Fare','Sex'])
# If there are any matches on these two columns this will show up as a duplicate value

0      False
1      False
2      False
3      False
4      False
       ...  
886     True
887     True
888    False
889     True
890     True
Length: 891, dtype: bool

In [77]:
query = df.duplicated(subset=['Fare','Sex'])
df[query]
# Stuff with duplicates

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S
37,38,0,3,"Cann, Mr. Ernest Charles",male,21.0,0,0,A./5. 2152,8.0500,,S
38,39,0,3,"Vander Planke, Miss. Augusta Maria",female,18.0,2,0,345764,18.0000,,S
42,43,0,3,"Kraeff, Mr. Theodor",male,,0,0,349253,7.8958,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [78]:
query = df.duplicated(subset=['Fare','Sex'])
df[~query]
# Stuff without duplicates

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
876,877,0,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


In [80]:
?df.duplicated

In [88]:
query = df.duplicated(subset='Ticket')
# We're going to miss the first person in every group

In [89]:
df[query]['Ticket']
# Querying to get the unique values of tickets


24         349909
71        CA 2144
88          19950
117         11668
119        347082
          ...    
876          7534
879         11767
880        230433
885        382652
888    W./C. 6607
Name: Ticket, Length: 210, dtype: object

In [90]:
unique_tickets = df[query]['Ticket'].unique()

In [91]:
unique_tickets
# Gives you a list of all the tickets that occurred more than once

array(['349909', 'CA 2144', '19950', '11668', '347082', 'S.O.C. 14879',
       '237736', '35281', '2651', '113803', 'W./C. 6608', '3101295',
       '347088', '1601', '382652', '347742', 'CA. 2343', '347077',
       '230080', 'PC 17569', '3101278', '4133', '36973', '2665', '347054',
       'LINE', 'PC 17558', '113781', '244367', '248738', 'PC 17760',
       '363291', '367226', 'PC 17582', '345764', '113776', '16966',
       '349237', '113505', '370365', 'PC 17604', '113789', '35273',
       'PP 9549', 'STON/O2. 3101279', '19928', '239853', '370129',
       '113760', '29106', 'F.C.C. 13529', '250644', 'C.A. 34651', '2666',
       '110465', '11967', '19943', 'C.A. 37671', '2627', '110152',
       'PC 17758', '371110', '111361', '26360', '2668', 'PC 17761',
       '2908', 'C.A. 33112', '17421', 'PC 17757', '110413', '13507',
       '28403', '36947', '345773', 'PC 17485', '243847', 'SC/Paris 2123',
       '367230', '347080', 'A/5. 3336', '230136', '2653', '13502',
       'C.A. 31921', '3765

0      False
1      False
2      False
3       True
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Ticket, Length: 891, dtype: bool

**9). See if you can query a dataframe so that it only returns rows with passengers that are traveling in groups, based on their ticket numbers.**

To use: `df.isin()`, assuming you used the approach suggested in the previous question.

In [97]:
df['Ticket'].isin(unique_tickets)
# Sometimes you want to check for membership of certain values

0      False
1      False
2      False
3       True
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Ticket, Length: 891, dtype: bool

In [98]:
partofagroup = df['Ticket'].isin(unique_tickets)

In [99]:
df[partofagroup]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
...,...,...,...,...,...,...,...,...,...,...,...,...
876,877,0,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [104]:
# Accidentally deleted the question
# Can you select data that's text
import numpy as np
df.select_dtypes(include=np.object)
# Data type you want to select goes after the equal sign
# You want to use a numpy version of the data type
# np object is text
# You recast strings as numpy objects so that they behave well with null values

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S
...,...,...,...,...,...
886,"Montvila, Rev. Juozas",male,211536,,S
887,"Graham, Miss. Margaret Edith",female,112053,B42,S
888,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,,S
889,"Behr, Mr. Karl Howell",male,111369,C148,C


In [106]:
df.select_dtypes(include=np.float)
# You can change this either way, floats, integers, numbers, letters

Unnamed: 0,Age,Fare
0,22.0,7.2500
1,38.0,71.2833
2,26.0,7.9250
3,35.0,53.1000
4,35.0,8.0500
...,...,...
886,27.0,13.0000
887,19.0,30.0000
888,,23.4500
889,26.0,30.0000


In [101]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

**11).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [107]:
df.select_dtypes(include=np.number)
# Just select every single numeric data type
# You could save this as a variable to reuse it

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


In [109]:
df.select_dtypes(include=np.number).columns.tolist()
# List of all the numeric columns

['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [110]:
num_cols = df.select_dtypes(include=np.number).columns.tolist()
# Turn it into a variable

In [111]:
df[num_cols]
# If you want to reuse the columns over and over again
# If you want to see multiple values you're basically just passing through a list
# df[['Embarked']['Fare']] is the same thing as using a variable that has a list of those labels

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


**12). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [112]:
df[num_cols]

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000
887,888,1,1,19.0,0,0,30.0000
888,889,0,3,,1,2,23.4500
889,890,1,1,26.0,0,0,30.0000


In [113]:
# Fill all of the missing values with the average value
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())

In [114]:
df[num_cols]

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.000000,1,0,7.2500
1,2,1,1,38.000000,1,0,71.2833
2,3,1,3,26.000000,0,0,7.9250
3,4,1,1,35.000000,1,0,53.1000
4,5,0,3,35.000000,0,0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,27.000000,0,0,13.0000
887,888,1,1,19.000000,0,0,30.0000
888,889,0,3,29.699118,1,2,23.4500
889,890,1,1,26.000000,0,0,30.0000


In [115]:
df[num_cols].isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64

In [None]:
# Yay! No null values