<a href="https://colab.research.google.com/github/futureCodersSE/working-with-data/blob/main/Splitting_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filtering and splitting data
---



A dataframe contains rows and columns.  Most operations will need specific columns and specific rows.  Knowing how to isolate the rows and columns you need to work with is the focus of this worksheet.  You will split columns, rows by index (head, tail, iloc) and filter rows by given criteria.

To start the first set of exercises on this sheet, read the Titanic data set in the CSV file at this URL: https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

You will need to run this cell each time you come back to this worksheet.

Name read the data from the file into a dataframe called **titanic**

In [11]:
import pandas as pd

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(url)
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Splitting across columns and rows
---

For further reference:  [How do I select a subset of a Dataframe - Pandas Getting Started Tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html)

### Exercise 1 - create a dataframe containing a subset of columns

Create a new dataframe called **survival** which contains just the `Name`, `Sex`, `Age` and `Survived` columns.  Display the first 5 rows of the new data column.

(*Reminder: use [ ] to specify a column or a set of columns.  Where there is a set of columns, these should be included in a list inside the main squar brackets e.g. df[ [ item1, item2, item3 ] ] so that there is only ever ONE item in the outer brackets*)

**Test output**:  
	Name	Sex	Age	Survived  
0	Braund, Mr. Owen Harris	male	22.0	0  
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1  
2	Heikkinen, Miss. Laina	female	26.0	1  
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1  
4	Allen, Mr. William Henry	male	35.0	0  

In [None]:
survival = titanic[['Name','Sex','Age','Survived']]
survival.head()
  

### Exercise 2 - and another subset of columns

Create a new dataframe called **fares** which contains the columns `Pclass`, `Cabin`, `Ticket` and `Fare`.  Display the final 8 rows.

**Test output**:   
	Pclass	Cabin	Ticket	Fare  
883	2	NaN	C.A./SOTON 34068	10.500  
884	3	NaN	SOTON/OQ 392076	7.050  
885	3	NaN	382652	29.125  
886	2	NaN	211536	13.000  
887	1	B42	112053	30.000  
888	3	NaN	W./C. 6607	23.450  
889	1	C148	111369	30.000  
890	3	NaN	370376	7.750  


In [None]:
fares = titanic[['Pclass','Cabin','Ticket','Fare']]
fares.tail(8)

## Filtering rows according to given criteria
---

To select records according to a given criteria, specify the criteria in the [ ] after the dataframe.  There may be one criterion or a set of criteria, in this case enclose each criterion in brackets ( ) and use logical symbols (e.g. & | !) or comparison operators (e.g. ==, < > !=) or  .

**Example 1**
The following will create a new dataframe called survivors which contains only the records of those who survived the sinking.  

`survivors = titanic[titanic['Survived'] == 1]`

The first five records of the `survivors` dataframe will be passengers with the ids 2, 3, 4, 9 and 10 and the shape of `survivors` will be `(342, 12)  `

**Example 2**
The following will create a new dataframe called **young_females** which contains only the records of women under the age of 30 who survived the sinking.  

`young_females = titanic[(titanic['Sex'] == 'female') & (titanic['Age'] < 30)]`

The last five records of the `young_females` dataframe will be passengers with the ids 875, 876, 881, 883 and 888 and the shape of `young_females` will be `(147, 12)  

Try these below


In [None]:
young_females = titanic[(titanic['Sex'] == 'female') & (titanic['Age'] < 30)]
print(young_females.shape)
young_females.tail()

### Exercise 3 - find the third class passengers
---

Create a new dataframe called **third_class_passengers** which contains only the records for passengers who travelled in passenger class 3.  Display the first 20 records

**Test output**:  
shape = (491, 12)  
PassengerIds - 1,3,5,6,8,9,11,13,14,15,17,19,20,23,25,26,27,29,30,33  



### Exercise 4 - female 1st class passengers who survived
---

Create a new dataframe called **female_1st_class_survivors** which contains only the records for female passengers who travelled in passenger class 1 and who survived.  Display the last 10 records

**Test output**:  
(91, 12)  
PassengerIds - 830, 836, 843, 850, 854, 857, 863, 872, 880, 888  


## Filtered and split
---

When selecting on criteria and by column, specify the criteria first, then specify the columns.  Example - display the name and passenger class for female passengers under the age of 30:  

`young_females = titanic[(titanic['Sex'] == 'female')&(titanic['Age'] < 30][['Name','Pclass']]`

---



### Exercise 5 - name and passenger id for passengers who embarked at port C
---

Create a new dataframe called **port_embarkation_list** which contains only the records for passengers who embarked at port C.  Display the `Name` and `PassengerId` only for all records

**Test output**:  
(168, 2)  
PassengerIds shown - 1,9,19,26,30, ... 866, 874, 875, 879, 889  


### Exercise 6 - passenger id and age for all surviving passengers over 50
---

Create a dataframe called **older_survivors** which contains only the records for passengers who survived and who are older than 50.  Display the `PassengerId` and `Age` only for the last 15 records.  

**Test output**:  
shape = (22, 2)  
PassengerIds = 11, 15, 195, 268, 275, 366, 449, 483, 496, 513, 570, 571, 587, 591, 630, 647, 765, 774, 820, 829, 857, 879  


### Exercise 7 - display the Name and Age of the first male, 2nd class passenger who embarked at port Q
---


Create a dataframe called **male_2nd_Q** which contains only the records for passengers who embarked at port Q, travelled second class and were male.  Display the `Name` and `Age` of the first (and only) passenger in this list.

**Test output**:  
shape = (1, 2)  
Name = Kirkland, Rev. Charles Leonard  
Age = 57.0  

### Exercise 8 - summarise data on who survived in each passenger class
---


Create three dataframes to hold the records for passengers in each of the three passenger classes who survived.  Display the description of each set of `PassengerIds` for all survivors.  

*To print the description as a string use:*  
```
print('First class survivors') 
print(first_class.describe().to_string())
```

**Test output**:  
first_class =  count 136.000000 ...  
second_class =  count 87.000000 ...  
third_class =  count 119.000000 ...  

### Exercise 9 - summarise data on young males who survived and all males
---

Create two dataframes to hold the records for all male passengers in one and all young males who survived.  Display the description of each set including `Age` for all records in each.    

**Test output**:  
all_males = count 453.000000 mean 30.726645 ...  
young_males = count 93.000000 mean 27.276022 ...


### Exercise 10 - challenge

Create a set of code cells each with some code that shows the means or the counts for interesting data sets. Add a text cell before each set of code cells to explain what you are showing in the following set of cells. 

*To do this you can add the function to the end of the selection code as shown below*:  
```
young_males_avg_age = titanic[(titanic['Sex'] == 'male')]['Age'].mean()
print(young_males_avg_age)
```

An example might be that you are going to select passengers who embarked at port C and who paid a fare over 50.000 and you are going to count the number of `PassengerId` and `Cabin` for these passengers (*you will apply the count to the selected columns rather than the whole data table*)
```
embarked_passengers = titanic[(titanic['Embarked'] == 'C')&(titanic['Fare'] > 50)][['PassengerId','Cabin']].count()
print(embarked_passengers)
```

