# Lab 3: Laboratory Notes - Week 3: Exploratory analysis

From last week, you were to explore Open Data that's related to your campus location as well as the <span style="color:red">northwind.db</span>.  We do note that you may not have done enough SQL if you have not taken F28DM (and some of you will NOT be taking that course).  However, do keep the database and you can use it as a supplementary material to learn SQL with.  This week's laboratory work seems shorter but it is meant for you to explore while you go through it.  You will encounter errors and also experience various ways of handling the <span style="color:red">pandas</span> DataFrames.  For this week, we will start with exploring a data set.

## Initial Data Auditing

We'll now check various characteristics of the data. First let's have a look at the dimensions of the table:

<span style="color:red">import pandas as pd  
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"  
titanic = pd.read_csv(url)  
<span style="color:red">print(titanic.shape)</span>  

How many rows and columns are there?

We can investigate the data types of each column, by calling " <span style="color:red">dtypes</span>" on the entire DataFrame:

<span style="color:red">titanic.dtypes</span>

Once we have some idea, an important numeric audit is the 5 number summary.

## 5 Number Summary

A "5 number summary" is used for descriptive analyses of large data sets, or as a preliminary investigation.  This is typically the step after ensuring that you have read the data correctly (using <span style="color:red">head()</span> or <span style="color:red">tail()</span>).  The 5 values consist of:

maximum
minimum
median
lower (or first) quartile (25%)
upper (or third) quartile (75%)
For those who are well versed with boxplots, they are similar.  In addition to these 5 number summaries, many descriptive analyses will also provide the mean number.  Do note that this is only for numeric data, while categorical data have a different descriptive analysis. We use the variable name df here as a general reference to a DataFrame type. Substitute it appropriately with your DataFrame.

df.describe()

For categorical data, this 5 numbers summary (including mean) may not mean much.  For categorical data, you may want to look at the count of the occurrence, the frequency, the top (number of frequency), and for any unique values.  For this, you inform the describe() function that you want to include the objects.   

df.describe(include='object')

With the head(), tail() and describe(), you would have the basics to get a feel of the data.  (You should be asking why .shape and .dtypes do not have parentheses.  Why?).

We will repeat this again next week.

## Introduction to Slicing

We have now read some data.  Let's revert back to the <span style="color:red">filename.csv</span> content.

<span style="color:red">import pandas as pd  
data = pd.read_csv("filename.csv")  
data</span>  

You should have 9 rows (0 to 8), with several columns. Note that in the R programming language, rows are referred to as observations and columns are referred to as variables.  You can get the column names using:

<span style="color:red">data.columns</span>  

But a neater way is to inform it that you want it in a list format:

<span style="color:red">list(data.columns)</span>  

You can then slice the DataFrame (table) by indexing the name using the <span style="color:red">.loc()</span> function or through a numeric index using the <span style="color:red">.iloc()</span> function.

<span style="color:red">data.loc[:,"ages"] # All rows, column "ages"  
data.iloc[:3,2] # Rows 0 to 3 (exclusive) and 3rd column (1st is indexed by 0)</span>

<span style="color:red">What!?</span> If you treat the table like a matrix, the values inside the [ ] is used to indicate the rows and columns, separated by the '<span style="color:red">,</span>'.   The '<span style="color:red">:</span>' is used to specify the range. If you want a specific row, you can simply use the index and the column portion is optional.

<span style="color:red">data.loc[1]</span>

Do try different combinations of slicing (we use this term for obtaining a subset) the DataFrame.  You can also slice it by providing the name without the <span style="color:red">.loc()</span> or <span style="color:red">.iloc()</span> functions (as per last week's session).

<span style="color:red">X = data["ages"]  
print(X)</span>  

and try

<span style="color:red">Y = data[["ages"]]  
print(Y)</span>  

Do you notice the difference?  Check on the <span style="color:red">type()</span> for <span style="color:red">X</span> and <span style="color:red">Y</span>.  You will notice that one is called a Series and the other a DataFrame.  You can think of a Series as a vector whereas a DataFrame as a matrix.

Let's go through the following and do use this opportunity to familiarise yourself further with the Jupyter Notebook and slicing.

<span style="color:red">import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"  
df = pd.read_csv(url)</span>  

Select a column by using its column name:

<span style="color:red">df['name']</span>  

What's the error message?  What does "<span style="color:red">KeyError: 'name'</span>" mean?  In this course and in using Python, you will encounter many different error messages.  Do take sometime to think about it and go through the error messages and familiarise yourself with them.  Do ask if you are unsure.  In this case, the error is that the variable name is case sensitive, meaning it should be '<span style="color:red">Name</span>' and not '<span style="color:red">name</span>'.

Let's proceed with selecting multiple columns using an list of column names:

<span style="color:red">df[['Name', 'Survived']]</span>  

![P1](picture/P1.png)

You will notice that it did not show rows 5 to rows 885.  Select a value using the column name and row index:

<span style="color:red">df['Name'][884]</span>  

Or a range of rows:

<span style="color:red">df['Name'][880:884]</span>    

To get multiple columns and multiple rows, you can do:

<span style="color:red">df[['Name','Survived']][880:884]</span>  

(Note the double <span style="color:red">[ ]</span>, or otherwise you can use the <span style="color:red">.loc()</span> function:

<span style="color:red">df.loc[880:884,['Name','Survived']]</span>  

Is there a difference?  The indexing seems to be different, so do be careful with this.  Let's revert and look at the .loc() function.  Select a particular row from the table:

<span style="color:red">df.loc[2]</span>  

Select all rows with a particular value in one of the columns:

<span style="color:red">df.loc[df['Age'] <= 6]</span>  

The last coding statement is having a condition in the outer [ ] where it is specified that you want all rows where the 'Age' column is less than or equal to 6.

## Writing (Saving) the Data

We have seen how we can read data from various sources, what about saving our pre-processed data?  Let us assume that you would only like to analyse a part of the data and you want to save the resulting DataFrame to a CSV file.  We already have the <span style="color:red">df</span> DataFrame, so we will create a new DataFrame called <span style="color:red">df2</span> to store Titanic passengers that are older than 12.  Here we assign the result from <span style="color:red">df.loc[df['Age'] >= 12]</span> to <span style="color:red">df2</span>.

<span style="color:red">df2 = df.loc[df['Age'] >= 12]</span>  

Select a location where you want to save it to.  I will use my own but you will not find it in your computer (unless you have the same name as me).

<span style="color:red">df2.to_csv(r'C:\Users\Ian\Desktop\output.csv', index = None, header=True)</span>  

For Linux and MacOS, the path will not have a "<span style="color:red">C:</span>" drive and also the separator is not a '<span style="color:red">\ </span>' (pronounced backslash) but a forward/normal slash ('<span style="color:red">/</span>').  In short, it should read (assuming that I have a MacOS):

<span style="color:red">df2.to_csv(r'/Users/Ian/Desktop/output.csv', index = None, header=True)</span>  

A few notes:

* The function used is part of the DataFrame, and it is called <span style="color:red">to_csv()</span>
* The <span style="color:red">r</span> is to inform the kernel (the executing Python engine) that the backslash '<span style="color:red">\ </span>' is not an escape character.  Read more [here](https://docs.python.org/3/tutorial/introduction.html#text).
* The parameters '<span style="color:red">index</span>' and '<span style="color:red">header</span>' is to specify what you want to preserve in the saved file.  You can refer to the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

Go check your folder to see if the file is saved correctly. Note that you can open the CSV file using a text editor (in Windows, you can use Notepad) or in MS-Excel (you should get something like below, noting that there are the missing rows 6, 8, 11, etc. where those rows do not meet the condition of >= 12 for the Age).

![P2](picture/P2.png)

Do try with different options for the parameters, the slicing (or subsetting) of the data.

#### Exercise 3.1: 

Would you be able to get the number of Male and Female for those above 18 and below 60?  This leads us to the concept of aggregating.

## Aggregation and Introduction to groupby

Continuing with the titanic data set, let's now focus on the age of passengers. We've already seen that we can calculate the average value of a column using the <span style="color:red">describe()</span> function. Alternatively, we could make use of the "<span style="color:red">mean()</span>" function directly, by typing:

<span style="color:red">df['Age'].mean()</span>

There are many other aggregation functions we might wish to use to investigate the age of the passengers. For example:

<span style="color:red">df['Age'].max() # oldest  
df['Age'].min() # youngest  
df['Age'].sum() # total years  
df['Age'].std() # standard deviation  
df['Age'].median() # half of the people were older (/younger)  
df['Age'].mode() # most common age</span>  

Knowing the average age over all passengers is interesting, but knowing the average age for different groups of passengers may be more useful. We could then answer questions like:

How old were the first-class passengers on average?
Was the average age the same for men and women?
Let's now have a more detailed look at the age of different groups. We can do that simply by making use of the groupby() command. We'll use the command to create a (hierarchical) index on "sex" and "class" values, so that we can then compute aggregate values (the mean) over each of the groups:

sex_class = df.groupby(['Sex','Pclass'])['Age']
sex_class.mean()

As part of the data auditing, it would be interesting to interpret the output, e.g. what did you notice about the average age of the Titanic’s passengers in regards to their classes? How about the relationship between age
and gender?

#### Exercise 3.2: 

Use other aggregation functions and groupings to determine which class had the oldest and youngest passengers and which gender had the largest amount of variation in age (standard deviation).

## Advanced Aggregation

Let's start afresh again.  This time we use the variable name '<span style="color:red">titanic</span>' instead of '<span style="color:red">df</span>' just to illustrate that it is just a name and we can use almost any name (as long as it is not a reserved keyword - the editor will tell you this by changing the name to a blue bold font if it is a keyword).

<span style="color:red">import pandas as pd # You don't need to do this again actually but included here for completeness  
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"  
titanic = pd.read_csv(url)  
titanic.head()</span>  

## Multiple data aggregation operations

Oftentimes we'd like to compute multiple aggregation operations at the same time. Here is an example where we compute statistics on multiple columns at once. (Note that you could also use the method to compute different aggregation functions on the same column of data). We specify the set of aggregation operations we wish to perform by detailing:

the columns the aggregation should be applied to (in this case the two columns are 'PassengerId' and 'Age'), and using the ':' as the separator,
the aggregation operations to apply in each case ('count' and 'mean'):
<span style="color:red">'PassengerId':'count'</span> and <span style="color:red">'Age':'mean</span>'

Let's put this into a temporary variable called '<span style="color:red">fun</span>'

<span style="color:red">fun = {'PassengerId':'count','Age':'mean'}</span>

Note that the variable '<span style="color:red">fun</span>' can be anything you want to call it. So now we've defined an aggregation operation that both counts the number of passengers and computes their average age. Let's apply that operation to the rows in the titanic table, grouping them by passenger '<span style="color:red">Pclass</span>'. To apply the operation we use the '<span style="color:red">agg()</span>' function:

<span style="color:red">groupbyClass = titanic.groupby('Pclass').agg(fun)  
groupbyClass</span>

The displayed table doesn't seem to have sufficient headers to indicate that the numbers are the count and the mean of the respective columns. 

What we need is the name of the resulting new columns (we'll call them '# passengers' and 'average age'):

<span style="color:red">groupbyClass = titanic.groupby('Pclass').agg(fun).rename(columns={'PassengerId': '# passengers', 'Age': 'avg age'})
groupbyClass</span>

Have a look at the output, which has now been grouped by passenger class. Which class had the most passengers and which one had the oldest passengers on average?

#### Exercise 3.3:

Modify the aggregation operation '<span style="color:red">fun</span>' above so that it also finds the age of the oldest and youngest passengers in each class. Note that all aggregate operations being applied to the same column need to be placed within the same set of curly braces '{}' and separated by commas ','. So fill in the MISSING parts of the function below:

<span style="color:red">fun2 = {'PassengerId':'count','Age':{'mean', [MISSING], [MISSING]}}  
groupbyClass2 = titanic.groupby('pclass').agg(fun2)  
groupbyClass2</span>  

Do note the effect of the curly braces <span style="color:red">{ }</span>. So who was the oldest passenger traveling in the 'first', 'second' or 'third' class?

In order to turn the output of the <span style="color:red">groupby()</span> operation into a DataFrame that can be further manipulated (into a cleaner tabular format with clear column names), we need to "flatten it" using the '<span style="color:red">reset_index()</span>' and '<span style="color:red">droplevel()</span>' commands. Have a look at the outputs of the following commands one after the other (by printing out the table each time) to see what they produce.

<span style="color:red">groupbyClass2 = groupbyClass2.reset_index()  
"""turn 'class' groups into column values"""  
groupbyClass2.columns = groupbyClass2.columns.droplevel(0)  
"""drop the top level in the column hierarchy"""  
groupbyClass2</span>  

Flattening caused us to lose the column name for the '<span style="color:red">class</span>' attribute. We can rename the column as follows:

<span style="color:red">groupbyClass2.rename(columns = {'':'class'},inplace = True)  
"""rename the first column to be 'class'"""  
groupbyClass2</span>  

## Custom aggregation operations

There are many inbuilt functions in Python that can be used to aggregate data over columns. For example the '<span style="color:red">nunique</span>' function will count the number of unique values in a list. Sometimes the function we need isn't available, however, because what we are after is too specific. For example, if we have a list of values, we might wish to count only those elements in the list with value above a certain threshold. Using the '<span style="color:red">for</span>' syntax in Python we can write an expression to count the elements as follows:

<span style="color:red">my_list = (80,20,64,19,56,12,88)  
sum(e>50 for e in my_list)</span> 

The expression is checking for each element '<span style="color:red">e</span>' in '<span style="color:red">my_list</span>' whether the value is greater than 50 or not, the <span style="color:red">sum()</span> function is then counting the number of times the greater-than expression returns <span style="color:red">TRUE</span> (i.e. the value 1). Now that we have a piece of code that can count the number of values that fit a condition, we'd like to use it in an aggregation operation over a column of a DataFrame. We can do that using an anonymous function (called a lambda function) in Python. The syntax to create an anonymous function is to write '<span style="color:red">lambda x:</span>' followed by the function itself, where '<span style="color:red">x</span>' is the name of the variable that appears in the function.  You can read more about it [here](https://www.w3schools.com/python/python_lambda.asp). But it would probably be useful to learn about creating [Python functions](https://www.w3schools.com/python/python_functions.asp).

<span style="color:red">fun3 = {'Age':{'nunique', lambda x: sum(e>50 for e in x)}}</span>

So we have defined a new aggregation operation '<span style="color:red">fun3</span>', which will create two new columns, a '<span style="color:red">unique age count</span>' column that counts the number of distinct values in the '<span style="color:red">age</span>' column using the function '<span style="color:red">nunique</span>' and an '<span style="color:red">over 50s count</span>' column that counts the number of values in the '<span style="color:red">age</span>' column that are greater than 50. Again, we can reformat the DataFrame so it's ready for use:

<span style="color:red">groupbyClass3 = titanic.groupby('Pclass').agg(fun3).reset_index()  
groupbyClass3</span>  

![P3](picture/P3.png)

<span style="color:red">groupbyClass3.columns = groupbyClass3.columns.droplevel(0)  
groupbyClass3.rename(columns = {'':'class'},inplace = True)  
groupbyClass3</span>  

![P4](picture/P4.png)

<span style="color:red">groupbyClass3.rename(columns = {'nunique':'unique age count','<lambda_0>':'over 50s count'}, inplace = True)  
groupbyClass3</span>

![P5](picture/P5.png)

#### Exercise 3.4:

Interpret the output and discuss your finding with other students.

## Summary

We can now read, describe, do basic data exploration and save the data.  You would have understood Series vs DataFrames.  We have also looked at error messages and should understand by now the execution sequencing in Jupyter.  Pause and take some time to review what you have learnt so far, especially with the slicing of the data options.  Next week we'll investigate some more sophisticated pandas commands as well as look at graphing data using Python and some wrangling.

## My code part

In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)
print(titanic.shape)


(891, 12)


In [3]:
print(titanic.select_dtypes)

<bound method DataFrame.select_dtypes of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                          

In [4]:
print(titanic.head())
print(titanic.tail())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
  

In [5]:
df=titanic.dropna()
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,0.47541,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,0.754617,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,4.0,512.3292


In [6]:
df.describe(include='object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,183,183,183,183,183
unique,183,2,127,133,3
top,"Cumings, Mrs. John Bradley (Florence Briggs Th...",male,19950,G6,S
freq,1,95,4,4,116


In [8]:
data=pd.read_csv("data/filename.csv")
data.columns

Index(['Unnamed: 0', 'names', 'ages', 'heights'], dtype='object')

In [9]:
list(data.columns)

['Unnamed: 0', 'names', 'ages', 'heights']

In [10]:
data.loc

<pandas.core.indexing._LocIndexer at 0x1aad7f31fe0>

In [11]:
data.iloc()

<pandas.core.indexing._iLocIndexer at 0x1aad7f30f00>

In [12]:
data.loc[:,"ages"] 

0     76
1     82
2    104
3     78
4     23
5     53
6     47
7     88
8     37
Name: ages, dtype: int64

In [13]:
data.iloc[:3,2]

0     76
1     82
2    104
Name: ages, dtype: int64

In [14]:
data.loc[1]

Unnamed: 0       2
names          Ted
ages            82
heights       1.69
Name: 1, dtype: object

In [15]:
x = data["ages"]
print(x)

0     76
1     82
2    104
3     78
4     23
5     53
6     47
7     88
8     37
Name: ages, dtype: int64


In [16]:
y = data[["ages"]]
print(y)

   ages
0    76
1    82
2   104
3    78
4    23
5    53
6    47
7    88
8    37


In [17]:
print(type(x))
print(type(y))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [18]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

In [19]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [20]:
df[['Name', 'Survived']]

Unnamed: 0,Name,Survived
0,"Braund, Mr. Owen Harris",0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,"Heikkinen, Miss. Laina",1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,"Allen, Mr. William Henry",0
...,...,...
886,"Montvila, Rev. Juozas",0
887,"Graham, Miss. Margaret Edith",1
888,"Johnston, Miss. Catherine Helen ""Carrie""",0
889,"Behr, Mr. Karl Howell",1


In [21]:
df['Name'][884]

'Sutehall, Mr. Henry Jr'

In [22]:
df['Name'][880:884]

880    Shelley, Mrs. William (Imanita Parrish Hall)
881                              Markun, Mr. Johann
882                    Dahlberg, Miss. Gerda Ulrika
883                   Banfield, Mr. Frederick James
Name: Name, dtype: object

In [23]:
df[['Name','Survived']][880:884]

Unnamed: 0,Name,Survived
880,"Shelley, Mrs. William (Imanita Parrish Hall)",1
881,"Markun, Mr. Johann",0
882,"Dahlberg, Miss. Gerda Ulrika",0
883,"Banfield, Mr. Frederick James",0


In [24]:
df.loc[880:884,['Name','Survived']]

Unnamed: 0,Name,Survived
880,"Shelley, Mrs. William (Imanita Parrish Hall)",1
881,"Markun, Mr. Johann",0
882,"Dahlberg, Miss. Gerda Ulrika",0
883,"Banfield, Mr. Frederick James",0
884,"Sutehall, Mr. Henry Jr",0


In [25]:
df.loc[2]

PassengerId                         3
Survived                            1
Pclass                              3
Name           Heikkinen, Miss. Laina
Sex                            female
Age                              26.0
SibSp                               0
Parch                               0
Ticket               STON/O2. 3101282
Fare                            7.925
Cabin                             NaN
Embarked                            S
Name: 2, dtype: object

In [26]:
df.loc[df['Age'] <= 6]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q
43,44,1,2,"Laroche, Miss. Simonne Marie Anne Andree",female,3.0,1,2,SC/Paris 2123,41.5792,,C
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S
63,64,0,3,"Skoog, Master. Harald",male,4.0,3,2,347088,27.9,,S
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S
119,120,0,3,"Andersson, Miss. Ellis Anna Maria",female,2.0,4,2,347082,31.275,,S
164,165,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S
171,172,0,3,"Rice, Master. Arthur",male,4.0,4,1,382652,29.125,,Q


In [27]:
df2 = df.loc[df['Age'] >= 12]

#### Exercise 3.1:

In [51]:
# Filter for people aged between 18 and 60
filtered_df = df[(df['Age'] >= 18) & (df['Age'] < 60)]

# Count the number of males and females
gender_counts = filtered_df['Sex'].value_counts()

print(gender_counts)

Sex
male      373
female    202
Name: count, dtype: int64


In [28]:
df2.to_csv('new_titanic.csv')

In [29]:
df2.to_csv('new_titanic.csv', index=None, header=True)

In [30]:
df['Age'].mean()

np.float64(29.69911764705882)

In [33]:
df['Age'].max() # oldest

np.float64(80.0)

In [34]:
df['Age'].min() # youngest

np.float64(0.42)

In [35]:
df['Age'].sum() # total years

np.float64(21205.17)

In [36]:
df['Age'].std() # standard deviation

np.float64(14.526497332334042)

In [37]:
df['Age'].median() # half of the people were older (/younger)

np.float64(28.0)

In [38]:
df['Age'].mode() # most common age

0    24.0
Name: Age, dtype: float64

In [39]:
sex_class = df.groupby(['Sex','Pclass'])['Age']
sex_class.mean()

Sex     Pclass
female  1         34.611765
        2         28.722973
        3         21.750000
male    1         41.281386
        2         30.740707
        3         26.507589
Name: Age, dtype: float64

#### Exercise 3.2:

In [52]:
# Drop NaN values in 'Age' to avoid calculation errors
df = df.dropna(subset=['Age'])

# 1. Find the class with the oldest and youngest passengers
age_by_class = df.groupby("Pclass")["Age"].mean()
oldest_class = age_by_class.idxmax()  # Class with highest average age
youngest_class = age_by_class.idxmin()  # Class with lowest average age

# 2. Find the gender with the largest variation in age
std_by_gender = df.groupby("Sex")["Age"].std()
highest_variation_gender = std_by_gender.idxmax()  # Gender with max standard deviation

# Output results
print(f"Class with the oldest passengers (on average): {oldest_class}")
print(f"Class with the youngest passengers (on average): {youngest_class}")
print(f"Gender with the largest age variation: {highest_variation_gender}")

Class with the oldest passengers (on average): 1
Class with the youngest passengers (on average): 3
Gender with the largest age variation: male


In [40]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic = pd.read_csv(url)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [41]:
fun = {'PassengerId':'count','Age':'mean'}

In [42]:
groupbyClass = titanic.groupby('Pclass').agg(fun)
groupbyClass

Unnamed: 0_level_0,PassengerId,Age
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,216,38.233441
2,184,29.87763
3,491,25.14062


#### Exercise 3.3:

In [43]:
fun2 = {'PassengerId':'count','Age':{'mean', 'max', 'min'}}
groupbyClass2 = titanic.groupby('Pclass').agg(fun2)
groupbyClass2

Unnamed: 0_level_0,PassengerId,Age,Age,Age
Unnamed: 0_level_1,count,max,min,mean
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,216,80.0,0.92,38.233441
2,184,70.0,0.67,29.87763
3,491,74.0,0.42,25.14062


In [44]:
groupbyClass2 = groupbyClass2.reset_index()
# turn 'class' groups into column values
groupbyClass2.columns = groupbyClass2.columns.droplevel(0)
# drop the top level in the column hierarchy
groupbyClass2

Unnamed: 0,Unnamed: 1,count,max,min,mean
0,1,216,80.0,0.92,38.233441
1,2,184,70.0,0.67,29.87763
2,3,491,74.0,0.42,25.14062


In [45]:
groupbyClass2.rename(columns = {'':'class'},inplace = True)
# rename the first column to be 'class'
groupbyClass2

Unnamed: 0,class,count,max,min,mean
0,1,216,80.0,0.92,38.233441
1,2,184,70.0,0.67,29.87763
2,3,491,74.0,0.42,25.14062


In [46]:
my_list = (80,20,64,19,56,12,88)
sum(e>50 for e in my_list)

4

In [47]:
fun3 = {'Age':{'nunique', lambda x: sum(e>50 for e in x)}}

In [48]:
groupbyClass3 = titanic.groupby('Pclass').agg(fun3).reset_index()
groupbyClass3

Unnamed: 0_level_0,Pclass,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,<lambda_0>,nunique
0,1,39,57
1,2,15,57
2,3,10,68


In [49]:
groupbyClass3.columns = groupbyClass3.columns.droplevel(0)
groupbyClass3.rename(columns = {'':'class'},inplace = True)
groupbyClass3

Unnamed: 0,class,<lambda_0>,nunique
0,1,39,57
1,2,15,57
2,3,10,68


In [50]:
groupbyClass3.rename(columns = {'nunique':'unique age count','<lambda_0>':'over 50s count'}, inplace = True)
groupbyClass3

Unnamed: 0,class,over 50s count,unique age count
0,1,39,57
1,2,15,57
2,3,10,68
