In [1]:
import pandas as pd

# Principles of Exploratory Data Analysis

<center><img src="https://media.tenor.com/images/db6cfa232c0fd929134c96c556d2ae3b/tenor.gif" width="400" height="400" />

**<span style='color:blue'>Exploratory Data Analysis (EDA)</span>** is an analysis approach that identifies general patterns in the data by summarizing the main characteristics and using data visualization methods.

Like much of what we do, there is no one set way of performing EDA but there are some general principles and common practices we can follow, as well as the pandas functions that help us out.

### Learn the Background of the Data

Data does not appear out of a vacuum. Who collected this data? Why was it collected? Under what circumstances? Is the data reliable? Could it be biased? What do each of the records (rows) represent? These are all very valuable questions that provide context for the data that you will be working with. 

### Read in the Data

You have to get ahold of the data to explore it! Usually with pandas you are going to read the data in from a csv file. In this case you will use this function which turns a csv into a dataframe and returns that dataframe for you to assign to your dataframe variable : `pd.read_csv('path_to_csv_location.csv')`. 

In [2]:
titanic_df = pd.read_csv('EDApt2Exercises-main/data/titanic.csv')
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**<span style='color:red'>Warning</span>** : the path to the csv can be relative to the location of the notebook. However if you move the notebook sometimes that can cause bugs, so in that case you may want to use the entire file path. 

### Determine How Much Data You Have

How many rows and columns are there in your data? This can help put other statistics in perspective based on if you have a large or relatively small dataset. A helpful pandas attribute for this is `dataframe_name.shape` which will give you the number of rows and column:

In [7]:
titanic_df.shape

(891, 12)

So we can see that we have 891 rows and 12 columns.

### Determine What the Columns Are, Their Data Types, and What They Mean

The columns in a dataset indicate what information is available for each of the records. Sometimes the column names are not very clear, and you will likely need to consult the documentation for the data to understand what the names mean and any nuances involved. You will also need to figure out the data type of the values contained in each column, as well as whether those values are categorical or numerical, discrete or continuous.

**<span style='color:orange'>Note</span>** : just because there are numbers in a column doesn't mean it is numerical...

This titanic data is a classic dataset from Kaggle. Information about the columns and their values can be found here : https://www.kaggle.com/competitions/titanic/data

To list the columns in the dataframe you can use `dataframe_name.columns`:

In [8]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

To view the data types of these column you can use `dataframe_name.dtypes`:

In [13]:
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

**<span style='color:blue'>Note</span>** : strings will be classified as `objects`, but you should check since you can't assume that object always means string

You can also display both of these pieces of information with `dataframe_name.info()`:

In [9]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Understand Your Categorical Data

For columns that are categorical, you will want to determine what the categories mean and what the possible categories are, especially if those categories are abstracted in some way (e.g. each category turned into a number). 

To see the number of unqiue values in each column you can use `dataframe_name.nunique()` (this function can also help you determine which columns are categorical):

In [16]:
titanic_df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

To actually see each of the values that show up in a column you can use `dataframe_name['column name'].unique()`. 

In [21]:
titanic_df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

To see how many of each category are in the column you can use `dataframe_name['column name'].value_counts()`:

In [22]:
titanic_df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [8]:
# Which of the above categories seems like it will be categorical? Pick one and let's look at the unique values and how many times each shows up
titanic_df['Survived'].unique()

array([0, 1], dtype=int64)

In [9]:
titanic_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

### Look at How Sparse the Data Is

**<span style='color:red'>NOTE FOR NEXT TIME TEACHING</span>** : https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d and https://www.justintodata.com/data-cleaning-techniques-python-guide/ from the data cleaning section includes good methods for visualizing how much data is missing

**<span style='color:blue'>Sparseness</span>** refers to how much of the data is missing from the dataset. Datasets or individual columns that are very sparse will have a lot of data missing, which can be problematic for analysis. 

In programming, missing values are called **Null** values. To check how many values are missing from each column you case use `dataframe_name.isnull().sum()`:

In [5]:
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

*In what situation would these numbers be problematic?*

How you will want to handle Null values will depend on the situation. These data values could be missing randomly, or systematically (aka for a reason). When Null values are not random they may provide insights or show valuable trends. 

Some ways you may end up handling Null values:

1. Determine that the data is too sparse to provide meaningful results.

1. Leave them and just account for them in analysis.

1. Delete all of the rows with missing values (*when would this be a good approach?*)

1. Fill in the missing values with:

    - results from regression
    
    - the average value for the column
    
    - the most common value for that column

### Visualize the Data 

Coming up in a class soon!

### Check In

<center><img src="https://media.tenor.com/0eCH1vgDVBYAAAAM/its-always-good-to-check-in-dwayne-johnson.gif" width="400" height="400" />

Once you understand what the columns mean, the nuances of the data that they contain, and have a general idea of the data sparseness, this is a great time to check in with your analysis goals before going further. Is this really the data that you need to answer the questions the analysis is supposed to answer? If not, you should go back and get more/different data instead of wasting too much time with the current data set.