<a href="https://colab.research.google.com/github/AshleyBrooks213/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/AshleyBrooks_DS21_Exploratory_Data_Analysis_(DS21)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis

## Objectives

- Save your assignments to GitHub
- Load a dataset (CSV) from a URL using pandas.read_csv()
- Load a dataset (CSV) from a local file using pandas.read_csv()
- Use basic Pandas functions for Exploratory Data Analysis (EDA)
- Describe and discriminate between basic data types such as categorical, quantitative, continuous, discrete, ordinal, nominal and identifier

## Intro to Git & GitHub

> **Distributed** non-linear version control system

[https://en.wikipedia.org/wiki/Git](https://en.wikipedia.org/wiki/Git)


- vs Google Docs-style (linear) collaboration (one centralised file)
  - Keeps track of file changes
  - Coordinates work in development teams

### Git vs GitHub

- Git is the version-control software that we use locally
- GitHub is the central place (server) where team members collaborate and push (up) their code.
- Alternatives:
    - GitLab
    - BitBucket
    - SourceForge

### Git Keywords Explained

- `tree`:
    - The sum of the project's files and folders
- `master`:
    - Our main/initial branch
- `HEAD`:
    - The latest commit in the current branch
    - The **tip** of the current branch
- `remote`:
    - The central repository that team members use to exchange their changes
    - Think remote/far away (on the internet)
- `push`:
    - Think "push up" or "launch" our changes to our remote repo on the internet
- `pull`:
    - Think "push down" our changes oor getting in sync with our remote repo
- `clone`:
    - Copy the online repository to our local machine

## Loading Datasets

In order to practice Loading Datasets into Google Colab, we're going to use the [Flags Dataset](https://archive.ics.uci.edu/ml/datasets/Flags) from UCI to show both loading the dataset via its URL and from a local file.

Steps for loading a dataset:

1) Learn as much as you can about the dataset:
 - Number of rows
 - Number of columns
 - Column headers (Is there a "data dictionary"?)
 - Is there missing data?
 - **OPEN THE RAW FILE AND LOOK AT IT. IT MAY NOT BE FORMATTED IN THE WAY THAT YOU EXPECT.**

2) Try loading the dataset using `pandas.read_csv()` and if things aren't acting the way that you expect, investigate until you can get it loading correctly.

3) Keep in mind that functions like `pandas.read_csv()` have a lot of optional parameters that might help us change the way that data is read in. If you get stuck, google, read the documentation, and try things out.

4) You might need to type out column headers by hand if they are not provided in a neat format in the original dataset. It can be a drag.

#### Learning about the dataset and looking at the raw file.

In [None]:
# Find the actual file to download
# From navigating the page, clicking "Data Folder"
# Right click on the link to the dataset and say "Copy Link Address"

# You can "shell out" in a notebook for more powerful tools
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

# Funny extension, but on inspection looks like a csv


# Extensions are just a norm! You have to inspect to be sure what something is

### Loading the Dataset Via Its URL

In [None]:
# Load the flags dataset from its URL:
flag_data_url= 'https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data'

import pandas as pd

df=pd.read_csv(flag_data_url)

#### Fixing our column headers

In [None]:
# If you really mess things up you can always just restart your runtime

column_headers=['name', 'landmass', 'zone', 'area', 'population', 'language', 
                  'religion', 'bars', 'stripes', 'colours', 'red', 'green', 
                  'blue', 'gold', 'white', 'black', 'orange', 'mainhue', 
                  'circles', 'crosses', 'saltires', 'quarters', 'sunstars', 
                  'crescent', 'triangle', 'icon', 'animate', 'text', 'topleft', 
                  'botright']

### Loading Datasets From a Local File

#### Method 1: Google Colab File Upload Package
- What should we google to try and figure this out?

In [None]:
from google.colab import files
uploaded = files.upload()

Saving adult.data to adult.data


### Method 2: Using the Colab GUI (Graphical User Interface)

### Let's fix the column headers

In [None]:
column_headers = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                 'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                 'capital-gain',  'capital-loss', 'hours-per-week', 
                 'native-country', 'income']
df = pd.read_csv('adult.data', names=column_headers)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Exploratory Data Analysis Using Pandas

> Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations

Exploratory Data Analysis is often the first thing that we'll do when starting out with a new dataset. How we treat our data, the models we choose, the approach we take to analyzing our data and in large part the entirety of our data science methodology and next steps are driven by the discoveries that we make during this stage of the process.

What can we discover about this dataset?

- `df.shape` - returns the rows + columns of our dataset (dimensionality)
- `df.head()` - returns the first n rows
- `df.tail()` - returns the last n rows
- `df.dtypes` - show the data types for each column
    - Int64 = integers (whole numbers) 64-bit integers
    - Object = a column with strings in it
    - We'll talk a lot more about this tomorrow
- `df.describe()` - Will show the summary statistics
    - "Column", "variable", and "feature" are synonymous
    - Gives us the Numeric columns by default
    - `exclude='number'` will give us the summary stats of the non-numeric columns
- `df['column'].value_counts()` - returns counts of unique rows
    - We can access an individual column
    - value_counts - counts up all of the different quantites that we have

In [None]:
#Determine the dimensions of the dataset

df.shape


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [None]:
import numpy as np
df = df.replace({' ?' : np.NaN})


In [None]:
#Determine the data types
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [None]:
# Summary Statistics - we'll talk more about what these mean later
# numeric columns by default (integers)
df.describe()



Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [None]:
# non-numeric columns
df.describe(exclude='number')

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
count,32561,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,22696,10501,14976,4140,13193,27816,21790,29170,24720


In [None]:
# accesss a specific column of the dataframe
df['relationship']

0         Not-in-family
1               Husband
2         Not-in-family
3               Husband
4                  Wife
              ...      
32556              Wife
32557           Husband
32558         Unmarried
32559         Own-child
32560              Wife
Name: relationship, Length: 32561, dtype: object

In [None]:
#How many individuals are in each group?
df['marital-status'].value_counts()
df['age'].value_counts()



36    898
31    888
34    886
23    877
35    876
     ... 
83      6
85      3
88      3
87      1
86      1
Name: age, Length: 73, dtype: int64

In [None]:
# check for missing values
# the number of missing values in each column
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64

In [None]:
# drop rows from the dataset
df.drop(0)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [None]:
# axis=1 to look through column headers and not row index
#Drop ID variable
df.drop('income', axis=1)




Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States


## Intro to Variable Types

First, data can most easily be classified as categorical or quantitative.

- Categorical data places each observation into one and only one category: hair color, eye color, favorite flavor of ice cream, letter grade in a class, zip code

- Quantitative data measures something: height, weight, income, number of children

Categorical data can further be classified as ordinal, nominal or an identifier variable.
- Nominal data has no natural ordering: hair color, eye color
- Oridnal data has a natural ordering: letter grades - A, B, C, D, F
- Identifier variables identify each record uniquely and are not analyzed

Quantitative data can further be classified as discrete or continuous.
- Discrete data can be counted in a finite amount of time: Number of individuals riding on a bus
- Continuous data can be measured ever more precisely: My age is 38.134283948577 years old.

#### Let's import the Titanic.csv dataset and identify the different variable types:

In [None]:


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Titanic.csv'
df2=pd.read_csv(data_url)
print(df2.shape)
df2.describe()

(887, 8)


Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses_Aboard,Parents/Children_Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


[Titanic Data Dictionary](https://github.com/LambdaSchool/data-science-practice-datasets/tree/main/unit_1/Titanic)




- Which variable is the identifier variable?
- Which variables are categorical?  Are they ordinal or nominal?
- Which variables are quantiative?  Are they quantitative or discrete?


In [None]:
#Identifier Variable: Name
#Pclass = Categorical, Ordinal 
#Survived = Quantitative, Discrete
#Sex = Categorical, Nominal
#Age = Quantative, Discrete
#Siblings/Spouses_Aboard = Quantitative, Discrete
#Parents/Children_Aboard = Quantitative, Discrete
#Fare = Quanitative, Continuous