# DATA VISUALIZATION

## You have been provided with data from the Titanic Passenger Roaster, depicting if an individual survived the sinking of the ship or not. ‘Survived’ is denoted by (1) and ‘Did Not Survive’ is denoted by (0). Only using Plotly, graph the following information from the data. Feel free to use any of the Plotly submodules discussed (graph_objects, plotly express or iplot). Remember to label your axis and name your plots appropriately. For each given question, only one figure is expected.

### 1.Show the age distribution in the data using a histogram.
### 2.Show the age distribution based on gender using a histogram.
### 3.Using the function df.corr() to identify the correlation within the data, represent its results using a form of a            matrix plot.
### 4.Pivot the data setting the column Pclass as the columns and Fare as the values. From the resulting                         structure, use a boxplot to show the distribution of the values in its 3 columns.
### 5.Graph the value counts of the number of passengers who survived and did not survive based on gender                 using a stacked bar graph.
### 6.Using a scatter plot, plot the ages to the fare paid by the each passenger based on their gender.
### 7.Plot a bubble plot of the ages to the fare paid by each passenger categorizing whether they survived or not.         The size of each bubble should be determined by the passenger class and the name of each individual as               the hover name.

# 1.0 Importing necessary Python Libraries

In [2]:
import pandas as pd
import plotly.express as px

# 1.1 Reading Data From Our Csv File

In [5]:
#creating a dataframe and naming it titanic
titanic=pd.read_csv('Titanic Data.csv')
titanic


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# 1.2 Exploring Our Dataset

In [6]:
#understanding the data
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
#summary statistics for numeric columns
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# 1.3 Data Cleaning

In [8]:
#Identifying null values
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [9]:
#Most values in cabin are null,we'll therefore delete this column
titanic=titanic.drop('Cabin',axis=1)

In [10]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [36]:
#we can now delete all the other null values
titanic=titanic.dropna().reset_index()

In [37]:
titanic[titanic.isna().any(axis=1)]

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked


In [38]:
titanic.shape

(712, 12)

In [39]:
titanic[titanic.duplicated()]

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked


# 1.4 Answering Research Questions

## 1.4.1 Show the age distribution in the data using a histogram.

In [40]:
#summary statistics for the age column
titanic['Age'].describe()

count    712.000000
mean      29.642093
std       14.492933
min        0.420000
25%       20.000000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [86]:
#plotting the data
fig=px.histogram(titanic,x="Age",title="Histogram showing Passenger's Age Distribution")
fig.show()

### Most Passengers in the titanic were in the age bracket of 17-32

## 1.4.2 Show the age distribution based on gender using a histogram.

In [87]:
titanic.groupby(["Age"])["Sex"].count()

Age
0.42     1
0.67     1
0.75     2
0.83     2
0.92     1
        ..
70.00    2
70.50    1
71.00    2
74.00    1
80.00    1
Name: Sex, Length: 88, dtype: int64

In [88]:
#plotting the data
fig = px.histogram(titanic,x="Age",color="Sex",title="Passenger's Age Distrbution Based on Gender")
fig.show()

## 1.4.3 Using the function df.corr() to identify the correlation within the data, represent its                   results using a form of a matrix plot.

In [89]:
#correlation in our dataframe
df=titanic.corr()
df

Unnamed: 0,index,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
index,1.0,1.0,0.029526,-0.035609,0.033681,-0.082704,-0.011672,0.009655
PassengerId,1.0,1.0,0.029526,-0.035609,0.033681,-0.082704,-0.011672,0.009655
Survived,0.029526,0.029526,1.0,-0.356462,-0.082446,-0.015523,0.095265,0.2661
Pclass,-0.035609,-0.035609,-0.356462,1.0,-0.365902,0.065187,0.023666,-0.552893
Age,0.033681,0.033681,-0.082446,-0.365902,1.0,-0.307351,-0.187896,0.093143
SibSp,-0.082704,-0.082704,-0.015523,0.065187,-0.307351,1.0,0.383338,0.13986
Parch,-0.011672,-0.011672,0.095265,0.023666,-0.187896,0.383338,1.0,0.206624
Fare,0.009655,0.009655,0.2661,-0.552893,0.093143,0.13986,0.206624,1.0


In [90]:
#plotting this data
fig=px.imshow(df,title="Heatmap for Titanic DataFrame")
fig.show()

## 1.4.4 Pivot the data setting the column Pclass as the columns and Fare as the values. From the resulting structure, use a boxplot to show the distribution of the values in its 3 columns.


In [91]:
titanic.pivot(columns="Pclass",values="Fare")

Pclass,1,2,3
0,,,7.250
1,71.2833,,
2,,,7.925
3,53.1000,,
4,,,8.050
...,...,...,...
707,,,29.125
708,,13.0,
709,30.0000,,
710,30.0000,,


In [92]:
#plotting this data
fig=px.box(titanic,x="Pclass",y="Fare",title="BoxPlot Showing the Distribution of Fare to Pclass")
fig.show()

## 1.4.5 .Graph the value counts of the number of passengers who survived and did not survive based on gender using a stacked bar graph.¶

In [57]:
titanic.groupby(["Survived","Sex"])["Sex"].count().unstack()

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,64,360
1,195,93


In [85]:
#plotting the data
df=titanic.groupby(["Survived","Sex"])["Sex"].count().unstack()
px.bar(df, barmode='stack', title='Survival by the Sex',opacity=0.75)


## 1.4.6  Using a scatter plot, plot the ages to the fare paid by the each passenger based on their gender.

In [78]:
fig=px.scatter(titanic,x="Age",y="Fare",color="Sex",title="Ages to Fare Paid based on Gender")
fig.show()

## 1.4.7 Plot a bubble plot of the ages to the fare paid by each passenger categorizing whether they survived or not. The size of each bubble should be determined by the passenger class and the name of each individual as the hover name.

In [93]:
#titanic_data[['Survived']] = titanic_data[['Survived']].astype('float64', copy=False)
#plotting the data
fig=px.scatter(titanic,x="Fare",y="Age",color="Survived",size="Pclass",hover_name="Name",log_x=True,title="Bubble Plot of the Ages to the fare Paid")
fig.show()