### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

### Section 1:  Selecting Data With Pandas

**1). What is the average age of all passengers on board?**

In [3]:
import pandas as pd
df = pd.read_csv (r'C:\Users\iulia\OneDrive\Documents\Data Science\titanic.csv')
df['Age'].mean()

29.69911764705882

**2). What are the median values of the Fare & SibSp columns?**

In [4]:
df[['Fare','SibSp']].median()

Fare     14.4542
SibSp     0.0000
dtype: float64

**3). What was the maximum fare paid among the first 100 passengers on board? (This would be the first 100 rows)**

In [5]:
df['Fare'][:100].max()

263.0

**4). What is the modal value of the last 4 columns in the dataset?**

In [6]:
df.iloc[:, -4:].mode()

Unnamed: 0,Ticket,Fare,Cabin,Embarked
0,1601,8.05,B96 B98,S
1,347082,,C23 C25 C27,
2,CA. 2343,,G6,


**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [7]:
df.iloc[:250, :3].mean()

PassengerId    125.500
Survived         0.344
Pclass           2.416
dtype: float64

### Section II: Selecting Based on Conditions

**1). How many females were on board the titanic? Men?**

In [8]:
print(df[df.Sex == 'female'].shape[0])
print(df[df.Sex == 'male'].shape[0])

314
577


**2). What was the survival rate for females on the titanic? Men?**

In [9]:
print(df[df.Sex == 'female']['Survived'].mean())
print(df[df.Sex == 'male']['Survived'].mean())

0.7420382165605095
0.18890814558058924


**3). What was the survival rate for people in either passenger class 1 or 2?**

In [10]:
print(df[(df.Pclass == 1) | (df.Pclass == 2)]['Survived'].mean())

0.5575


**4). Were woman more likely to survive if they were traveling without siblings?**

In [11]:
query = (df.Sex == 'female') & (df.SibSp == 0)
df[query]['Survived'].mean()

0.7873563218390804

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [18]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [23]:
?pd.Series.value_counts

In [29]:
df['Fare'].value_counts()

8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
8.4583      1
9.8375      1
8.3625      1
14.1083     1
17.4000     1
Name: Fare, Length: 248, dtype: int64

In [27]:
pd.Series.value_counts(df['Fare'])

8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
8.4583      1
9.8375      1
8.3625      1
14.1083     1
17.4000     1
Name: Fare, Length: 248, dtype: int64

**3). Can you find the column with the highest number of unique values?**

To use: `pd.Series.nunique`, and `df.sort_values()` if you want to sort it.

In [55]:
df.nunique().sort_values(ascending = True)

Survived         2
Sex              2
Pclass           3
Embarked         3
SibSp            7
Parch            7
Age             88
Cabin          147
Fare           248
Ticket         681
PassengerId    891
Name           891
dtype: int64

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [64]:
df.loc[:,df.isnull().sum()>0]

Unnamed: 0,Age,Cabin,Embarked
0,22.0,,S
1,38.0,C85,C
2,26.0,,S
3,35.0,C123,S
4,35.0,,S
...,...,...,...
886,27.0,,S
887,19.0,B42,S
888,,,S
889,26.0,C148,C


**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [86]:
df[df.isnull().any(axis=1)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [14]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**6).  Can you sort passengers according to how much they paid for a ticket?**

To use: `df.sort_values()`

In [88]:
df.sort_values('Fare',ascending = False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0000,C23 C25 C27,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
...,...,...,...,...,...,...,...,...,...,...,...,...
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0000,,S
822,823,0,1,"Reuchlin, Jonkheer. John George",male,38.0,0,0,19972,0.0000,,S
732,733,0,2,"Knight, Mr. Robert J",male,,0,0,239855,0.0000,,S


**7). Can you sort passengers according to how much they paid for a ticket, within each port of embarkment?**  

ie, sort the rows so that the passengers who embarked from port `C` are listed first, and then within port `C` everyone is sorted by how much they paid for a ticket.

To use: `df.sort_values()`

In [None]:
df.sort_values('Fare',ascending = False)

**8). If people traveled in a group they had the same ticket number.  Can you query your dataframe to return the tickets values that occurred more than once?  Ie, run a line in pandas that returns *a list* of ticket values that occurred more than once, not an entire dataframe.**

To use: there are a few methods you can use, but try `df.duplicated()`, along with `df.unique()`.  **Hint:** You can test for duplicated values on specific columns.

In [None]:
# your answer here

**9). See if you can query a dataframe so that it only returns rows with passengers that are traveling in groups, based on their ticket numbers.**

To use: `df.isin()`, assuming you used the approach suggested in the previous question.

In [None]:
# your answer here

**10).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [None]:
# your answer here

**11).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [None]:
# your answer here

**12). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [None]:
# your answer here