## 1. Question 1	 [20 points]

**In no more than 500 words, explain the differences between a Jupyter Notebook (.ipynb extension) and a Python file (.py extension). Focus on the advantages and disadvantages of one or the other, particularly for data analysts/scientists. You are encouraged to use examples and screenshots of the different scripts to support your arguments. **

**Advantages of Jupyter Notebook:**

1. Interactive: Jupyter Notebook provides an interactive environment for data analysis and scientific computing, allowing users to write and run code, visualise data, and document their work all in one place. This makes it easier for users to explore data and test ideas quickly.

2. Cell output: Jupyter Notebook displays output inline with the code via clear cells, allowing users to see the results of their code immediately after running it. This is particularly useful for data analysis, where users need to see the output of their code to make decisions about how to proceed.

3. Rich media support: Jupyter Notebook supports a wide range of media types, including images, videos, and HTML. This makes it easy for users to create rich, interactive documents that can be shared with others.

4. Easy collaboration: Jupyter Notebook allows users to share their work easily with others, either by sharing the notebook file or by publishing it as a web page. This makes it easier for teams to collaborate on data analysis and scientific computing projects.

**Disadvantages of Jupyter Notebook:**

1. Version control: Jupyter Notebook files are not always easy to version control using Git or other version control systems. This is because the files contain both code and text, making it difficult to track changes to the code over time.
2. Debugging: Debugging Jupyter Notebook files can be more difficult than debugging Python files, especially if the code is spread across multiple cells. This is because the code is executed in a non-linear way, which can make it harder to trace the flow of data through the code.

**And vice versa, the advantages of Python files:**

1. Version control: Python files are easier to version control using Git or other version control systems. 
2. Debugging: Debugging Python files is usually easier than debugging Jupyter Notebook files, especially if the code is well-structured and follows best practices. 


**Disadvantages of Python files:**

1. Lack of interactivity: Python files are not as interactive as Jupyter Notebook files, which can make it more difficult for users to explore data and test ideas quickly.
2. Lack of inline output: Python files do not display output inline with the code, which can make it harder for users to see the results of their code immediately after running it.

Overall, Jupyter Notebook is more suitable for exploratory data analysis and quick prototyping, while Python files are better for building more structured and scalable projects. Data analysts and scientists often use Jupyter Notebook for exploratory data analysis and Python files for building production-ready models and applications. For example, a data analyst might use Jupyter Notebook to explore a dataset and test different machine learning algorithms, then use Python files to build a production-ready model based on the results of their analysis.

## Question 2	 [10 points]

*What is the difference between a Pandas DataFrame and a Pandas series. Show an example of how you create each of them. *

In Pandas, a Series is a one-dimensional labelled array, while a DataFrame is a two-dimensional labelled data structure with columns of potentially different types. In other words, a DataFrame is a collection of Series.

A Pandas Series can be created using the `pd.Series()` function, and it can contain any data type, such as integers, strings, floats, or even other objects like lists or dictionaries. Here's an example:

In [1]:
import pandas as pd
my_series = pd.Series({"pencil case":5,"notebook":9,"eraser":2})
my_series

pencil case    5
notebook       9
eraser         2
dtype: int64

A DataFrame, on the other hand, can be created using the `pd.DataFrame()` function, and it can contain multiple columns, each of which can have a different data type. Here's an example:

In [3]:
data = {"items":["pencil case","notebook","eraser"],
        "price":[5,9,2],
        "if_in_store":[True, True, False]}
my_df = pd.DataFrame(data)
my_df

Unnamed: 0,items,price,if_in_store
0,pencil case,5,True
1,notebook,9,True
2,eraser,2,False


## Question 3	 [10 points]

*Starting from the argument that a Pandas DataFrame represents rectangular data, use the internet and other resources to describe in no more than a few sentences  the difference between rectangular and non-rectangular data.*  

Rectangular data is organised in a tabular format with rows and columns, where each row represents a unique observation and each column represents a variable. Each cell in a rectangular data structure contains a single value or data point. In contrast, non-rectangular data structures do not have a tabular format and may contain complex or hierarchical data structures. Examples of non-rectangular data structures include JSON files, XML files, and hierarchical data formats such as HDF5. Non-rectangular data structures may require specialised tools and techniques for manipulation and analysis, whereas rectangular data can be easily analysed using tools such as Pandas DataFrames.

## Question 4	 [10 points]

*Starting from the data visualisation usage from Session 2, give examples of when figures could be:*

*1. Of use to the data scientist to identify patterns in the data/highlight the important parts of a data set;* 

*2. Tell a story and create business presentations.* 

1. As a data scientist, figures can be of great use to identify patterns in the data or highlight important parts of a data set. For example, a **scatter plot** can be used to identify relationships between two variables, a **histogram** can be used to identify the distribution of a single variable, and a **boxplot** can be used to identify the spread of the data and potential outliers. By creating figures, data scientists can quickly identify trends, outliers, and patterns in the data, which can help inform decision-making and guide further analysis.

2. Data visualisation is also crucial in creating business presentations as it helps tell a story and communicate insights effectively. For instance, a **histogram** can show the trend of a company's sales over time, a **bar chart** can show the distribution of revenue across different departments, and a **heat map** can highlight the areas where the company is performing well or needs improvement. By presenting data in a visual format, business presentations can effectively communicate key insights to stakeholders, help guide decision-making, and drive organisational change.

## Question 5	 [50 points]

*Each sub-question is worth 10 points.*
*Using the titanic dataset which you can read into your notebook using the following code,* 

`import pandas as pd`

`titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')`

*answer the following questions:* 

*1. How many columns and rows does the data have?*

In [4]:
# question 1
import pandas as pd

titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic_df = pd.DataFrame(titanic)
rows = titanic_df.shape[0]
columns = titanic_df.shape[1]
print(rows, columns)

891 15



*2. Get a sense of your data and find the min, max, and count/mean depending on the data type.*



In [5]:
result = titanic_df.groupby('sex').agg({'age':['mean','min','max']})

result

Unnamed: 0_level_0,age,age,age
Unnamed: 0_level_1,mean,min,max
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,27.915709,0.75,63.0
male,30.726645,0.42,80.0


I calculated the mean, min, and max values of passengers’ age group by sex to get an overview of the age range of different gender. 


*3. Give an overview (code and an explanation) of all missing values in the data.* 



In [6]:
[col for col in titanic_df.columns if titanic_df[col].isnull().any()]

['age', 'embarked', 'deck', 'embark_town']

This line of code will show which columns contain NaN values - as we can see, there are a total of 4 columns containing missing values, namely 'age', 'embarked', 'deck', and  'embark_town.'


In [7]:
count = titanic_df.isnull().sum()
count

# This displays the numbers of missing values in each column. 

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [8]:
percentage = round(100*titanic_df.isnull().sum()/len(titanic_df),2)
percentage

# The output of this cell of codes shows the proportion of missing values among all values in each column. 

survived        0.00
pclass          0.00
sex             0.00
age            19.87
sibsp           0.00
parch           0.00
fare            0.00
embarked        0.22
class           0.00
who             0.00
adult_male      0.00
deck           77.22
embark_town     0.22
alive           0.00
alone           0.00
dtype: float64

*4. Delete the rows where you do not have information about the age of the person. Then group the passengers in a 10 year age range (for example, you can do something like 0 – 10, 11 – 20, 21 – 30, etc).*


In [9]:
# delete the rows
new_df = titanic_df.dropna(subset = ['age'])
new_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [12]:
# group the passengers
bins = [0,10,20,30,40,50,60,70,80]
labels = ['0-10','11-20','21-30','31-40','41-50','51-60','61-70','71-80']
new_df.loc[:,'age_range']=pd.cut(new_df['age'],bins = bins, labels = labels)
grouped_df = new_df.groupby(['age_range']).size().reset_index(name = 'count')
grouped_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.loc[:,'age_range']=pd.cut(new_df['age'],bins = bins, labels = labels)
  new_df.loc[:,'age_range']=pd.cut(new_df['age'],bins = bins, labels = labels)


Unnamed: 0,age_range,count
0,0-10,64
1,11-20,115
2,21-30,230
3,31-40,155
4,41-50,86
5,51-60,42
6,61-70,17
7,71-80,5


*5. For each age category created in d), find out how many passengers are female/male, and how many travelled in each class.*

In [17]:
grouped_df1 = new_df.groupby(['age_range','sex']).size().reset_index(name = 'count')
female1 = len(grouped_df[(grouped_df["age_range"]=="0-10") & 
         (grouped_df["sex"]=="female") ])
female1

# In age group 0 to 10, there were 3 female passengers.

3

In [18]:
grouped_df1 = new_df.groupby(['age_range','class']).size().reset_index(name = 'count')
First1 = len(grouped_df[(grouped_df["age_range"]=="0-10") & 
         (grouped_df["class"]=="First") ])
First1

# In age group 0 to 10, there were 2 passengers traveled in First class.

2