# Day Two: Understanding and Visualising Data
When you first start to analyse a dataset, you will not always know all the details about how it was collected and why it has been formatted in the way it is.

Therefore, to understand these aspects, you must conduct some basic statistical analysis. You will also begin to visualise data, which is a useful way for you to get an idea of how to interpret it.

In [None]:
#First we must start by importing the pandas library 
#Also must import the matplot lib library which allows us to be able to visualise the data 
import pandas as pd 
from matplotlib import pyplot as plt


# Uncomment if you want to run off your machine - without internet connection 
#data = pd.read_csv('Data/titanic.csv')

#This one can only be run with an internet connection 
data = pd.read_csv('https://raw.githubusercontent.com/chroadhouse/Futureme/main/Data/titanic.csv')

#Run this code to make sure the data is read into the file 
data.head()

# Creating our own table: Running metrics on data
Even though you can run the .describe() function that returns a table with all the metric data you would want, it is still useful to be able to do this ourselves. 

Did you see above where we wrapped the link to our dataset in brackets? 

We then **assigned** it to a variable called *'data'* using **=**. This **stores** the dataset for us (have another look at yesterday's notebook if you need a recap on variables).

We can then single out one column of this dataset by calling this variable and inserting the name of the column we want inside square brackets **[ ]**.

In [None]:
#data['Age'] will only return the rows for the age column 
age_mean = data['Age'].mean()
age_mode = data['Age'].mode()
age_median = data['Age'].median()
age_max = data['Age'].max()
age_min = data['Age'].min()
age_stand = data['Age'].std()

#Here we create a dictionary - the *keys* are the titles of columns and the *values* are the variables we created above
age_table = pd.DataFrame({
    'Mean':age_mean,
    'Mode':age_mode,
    'Median':age_median,
    'Maximum':age_max,
    'Minumum':age_min,
    'Standard Deviation':age_stand
})

age_table

In [None]:
#Can also do this with another quantitative column such as the fare 
fare_mean = data['Fare'].mean()
fare_mode = data['Fare'].mode()
fare_median = data['Fare'].median()
fare_max = data['Fare'].max()
fare_min = data['Fare'].min()
fare_stand = data['Fare'].std()


fare_table = pd.DataFrame({
    'Mean':fare_mean,
    'Mode':fare_mode,
    'Median':fare_median,
    'Maximum':fare_max,
    'Minumum':fare_min,
    'Standard Deviation':fare_stand
})

fare_table

# Checking the Frequency of Categorical data
Data is more than just numbers!

Categorical data is a collection of information that is divided into groups (e.g. sex, age, education level) and it can also provide vital insight.

One of the best things you can do when looking at categorical data is check the **frequency** of each value.

In [None]:
#Value_count method returns the freqency of each category
embark_data = data['Embarked'].value_counts()

embark_data

In [None]:
#Here we replace the characters 'S', 'C' and 'Q' with the full names of each location, enclosed in quotes so they are 'strings'.
#This will make the dataset much easier to read.
data['Embarked']= data['Embarked'].replace({'S':'Southhampton', 'C':'Cherbourg', 'Q':'Queenstown'})

#Here we store the count and percentage to show in a table
c = data.Embarked.value_counts()
p = data.Embarked.value_counts(normalize=True).mul(100).round(1).astype(str)
value_count = pd.concat([c,p], axis=1, keys=['counts', '%'])

value_count

# Plotting Data 
Plotting data is one of the best ways to detect patterns and relationships.

To plot data in Python, we use the **matplotlib** library, which makes it much easier! When importing the library it is common practice to give it then name of **plt**.

To find out more, you can also read the documentation for this module: https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html


# Histograms
You may notice that Histograms look quite similar to a bar chart. They represent data in ranges along a horizontal axis.

In [None]:
#We create a histogram to show the age of passengers on the Titanic
plt.hist(data['Age'])
plt.xlabel('Age of passenger')
plt.ylabel('Number of passengers')
plt.title('Histogram of Passenger Age')
plt.show()

In [None]:
#Here we create histograms showing the number of people that didn't survive and did survive based on the fare they paid.
not_survived_fare = data['Fare'][data['Survived']=='Not Survive']
survived_fare = data['Fare'][data['Survived']=='Survived']
plt.figure(figsize=(12,6))
plt.subplot(121)
not_survived_fare.plot(kind='hist',title = 'People who didn\'t survive')


plt.subplot(122)
survived_fare.plot(kind='hist', title= 'People who survived')


# Pie Charts
A pie chart is divided up into slices to illustrate the numerical value of each category. It is best used for showing percentages.


In [None]:
#The below pie chart shows the percentage of males and females on the ship.
plt.figure(figsize=(10,10))
data.Sex.value_counts().plot(kind='pie', title='Pie chart to show the number of males and females', autopct='%d%%')

In [None]:
#Creating two pie charts to show the people that did and didn't survive based on where they got on the ship.
not_survived_class = data['Pclass'][data['Survived']=='Not Survive']
survived_class = data['Pclass'][data['Survived']=='Survived']

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
not_survived_class.value_counts().plot(kind='pie', title='people who didn\'t survived', autopct='%d%%')

plt.subplot(1,2,2)
survived_class.value_counts().plot(kind='pie', title='people who survived', autopct='%d%%')

# Bar Charts
Bar Charts represent Categorical Data. They let us compare numerical values like percentages and integers.

In [None]:
#The groupby method groups the data by what you pass in
plt.figure(figsize=(10,10))
data.groupby('Sex').Survived.value_counts().plot(kind='bar')
plt.title('Bar Chart to show the number of males and females that survived and did not survive')
plt.xlabel('Numer of males and females that survived')
plt.ylabel('Number of passengers')

## Bar Charts: Understanding our data
Datasets are rarely perfect. More often than not, there will be all kinds of information that seemingly disappears, leaving you with empty spaces in your spreadsheets and tables.

Therefore, you should always which values are **missing** when working with data.

The term **Null** refers to a value that is missing or unknown, so a non-null count will exclude these troublesome bits.

The table below shows us how much data we have for each column in the non-null count section.

Can you see anything wrong with the 'Cabin' column?


In [None]:
#Shows us how much data we have for each column in the not-null count section
data.info()

It appears we have quite a bit of missing data from the 'Cabin' column.

When working with numerical data in Python, we also have **NaN**, which stands for **Not a Number** and is often used to indicate when something is blank.

**.isna()** is a method that will identify NaNs in a dataset by returning the value **True**.

This method is useful to identify which data is missing.

In [None]:
#The isna() method will return whether or not the data is NaN
data.isna()

In [None]:
#By running the sum method on this data, it will give us the total of NaN's in the dataset
data.isna().sum()

In [None]:
#The tilde ( ~ ) is a character that inverses the isna() method - meaning the number of values that are present will be shown
fullData = ~data.isna()

#Customizing the chart we are going to create
plt.figure(figsize=(10,10))
plt.xlabel('Columns')
plt.ylabel('Number of data')
plt.title('Bar chart to show the number of valid data in each column')

#Here we actually create the barchart
fullData.sum().plot(kind='bar')

#This is bar chart shows all valid data - we can also see this using .info() but visualising this information can also be helpful.