<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Copy_of_LS_DS_111_Exploratory_Data_Analysis_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Loading and Exploring Datasets

This assignment is purposely open-ended. You will be asked to load datasets from the [UC-Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). 

Even though you maybe using different datasets than your fellow classmates, try and be supportive and assist each other in the challenges that you are facing. You will only deepend your understanding of these topics as you work to assist one another. Many popular UCI datasets face similar data loading challenges.

Remember that the UCI datasets do not necessarily have a file type of `.csv` so it's important that you learn as much as you can about the dataset before you try and load it. See if you can look at the raw text of the file either locally or using the `!curl` shell command -or in some other way before you try and read it in as a dataframe. This will help you catch what would otherwise be unforseen problems.

Feel free to embellish this notebook with additional markdown cells,code cells, comments, graphs, etc. Whatever you think helps adequately address the questions.

## 1) Load a dataset from UCI (via its URL)

Please navigate to the home page and choose a dataset (other than the Adult dataset) from the "Most Popular" section on the right-hand side of the home page. Load the dataset via its URL and check the following (show your work):

- Are the headers showing up properly?
- Look at the first 5 and the last 5 rows, do they seem to be in order?
- Does the dataset have the correct number of rows and columns as described in the UCI page? 
 - Remember, that UCI does not count the y variable (column of values that we might want to predict via a machine learning model) as an "attribute" but rather as a "class attribute" so you may end up seeing a number of columns that is one greater than the number listed on the UCI website.
- Does UCI list this dataset as having missing values? Check for missing values and see if your analysis corroborates what UCI reports?
- if `NaN` values or other missing value indicators are not being detected by `df.isnull().sum()` find a way to replace whatever is indicating the missing values with `np.NaN`.
- Use the .describe() function in order to see the summary statistics of both the numeric and non-numeric columns. 

In [0]:
# TODO your work here!
# And note you should write comments, descriptions, and add new
# code and text blocks as needed
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
  

In [0]:
import pandas as pd
import numpy as np

In [0]:
# Dataset appears to be missing column names given
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data")
df.columns
# A look at df.columns confirms this.

Index(['63.0', '1.0', '1.0.1', '145.0', '233.0', '1.0.2', '2.0', '150.0',
       '0.0', '2.3', '3.0', '0.0.1', '6.0', '0'],
      dtype='object')

In [0]:
# There are 14 attributes that should be included and 303 instances.
# The following code checks to confirm the data read in matches:
df.shape

(302, 14)

In [0]:
# Make note that the number of columns are correct but the data appears to be missing one row.

In [0]:
# Rename column headers to match attribute information given from source:
df.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
              'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

In [0]:
# Confirm the column name change worked:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

In [0]:
#Check dataset order:
print(df.head())
print(df.tail())
#Datasets appear to be in order. 

In [0]:
df.isnull().sum()
# There does not to seem to be any missing data; however, the UCI report states that there are missing values.

In [0]:
# Shows summary statistics of numeric columns and confirms only numbers are included:
print(df.describe(include=np.number))
df_numbers = df.describe(include=np.number)
print(df_numbers.dtypes)
# Shows summary statistics of non-numeric columns and confirms numbers are excluded:
print(df.describe(exclude=np.number))
df_non_numbers = df.describe(exclude=np.number)
print(df_non_numbers.dtypes)

## 2) Load a dataset from your local machine.
Choose a second dataset from the "Popular Datasets" listing on UCI, but this time download it to your local machine instead of reading it in via the URL. Upload the file to Google Colab using the files tab in the left-hand sidebar or by importing `files` from `google.colab` The following link will be a useful resource if you can't remember the syntax: <https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92>

- Answer all of the same bullet point questions from part 1 again on this new dataset. 


In [0]:
# Loads dataset from local machine:
from google.colab import files
uploaded = files.upload()

Saving iris.data to iris.data


In [0]:
# Check the raw data using curl:
!curl iris.data

curl: (6) Could not resolve host: iris.data


In [0]:
# Read in the uploaded Iris data:
df = pd.read_csv("iris.data")

In [0]:
# Check if the headers are loaded in correctly:
df.columns

Index(['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'], dtype='object')

In [0]:
# Check if rows/column numbers match UCI report:
df.shape

(149, 5)

In [0]:
# The headers are missing their titles, so using the UCI report as reference, 
# the following code adds them in.
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']


In [0]:
# Confirms the headers have been added in:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')

In [0]:
# Outputs the first five rows and the last five rows:
print(df.head())
print(df.tail())

In [0]:
# The UCI does not indicate missing values. The following code checks if the read-in data does:
df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

In [0]:
# Shows summary statistics of numeric columns and confirms only numbers are included:
df_numeric = df.describe(include=np.number)
print("The numeric-only columns include: \n")
print(df_numeric.dtypes)
# Shows summary statistics of non-numeric columns and confirms only non-numbers are included:
df_non_numeric = df.describe(exclude=np.number)
print("\nThe non-numeric columns include: \n") 
print(df_non_numeric.dtypes)

The numeric-only columns include: 

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
dtype: object

The non-numeric columns include: 

class    object
dtype: object


## 3) Make Crosstabs of the Categorical Variables

Take whichever of the above datasets has more categorical variables and use crosstabs to tabulate the different instances of the categorical variables.


In [0]:
# Loads in the correct dataset
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data")
df.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
              'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

In [0]:
# Tabulates the data between age and cholesterol
pd.crosstab(df['age'], df['chol'])

In [0]:
# Tabulates the data between sex (1 = male, 0 = female) and cholesterol (chol)
pd.crosstab(df['sex'], df['chol'])

In [0]:
# Tabulates the data between chest pain type (cp) and resting blood 
# pressure (trestbps)
pd.crosstab(df['cp'], df['trestbps'])

In [0]:
# Tabulates the data between fasting blood sugar (fbs) and resting 
# electrocardiograph results (restecg)
pd.crosstab(df['fbs'], df['restecg'])

In [0]:
# Tabulates the data between maximum heart rate achieved (thalach) and 
# exercise induced angina (exang; 1 = yes, 0 = no)
pd.crosstab(df['thalach'], df['exang'])

In [0]:
# Tabulates the data between ST depression induced by exercise relative 
# to rest (oldpeak) and the slope of the peak exercise ST segment (slope)
pd.crosstab(df['oldpeak'], df['slope'])

In [0]:
# Tabulates the data between number of major vessels (0-3) colored by 
# flourosopy (ca) and diagnosis of heart disease (num)
pd.crosstab(df['ca'], df['num'])

## 4) Explore the distributions of the variables of the dataset using:
- Histograms
- Scatterplots
- Density Plots

In [0]:
# Creates histogram plot of the dataset
df.plot.hist();

In [0]:
# Creates a scatterplot to compare age and chol
df.plot.scatter(x='age', y='chol');

In [0]:
# Creates a scatterplot to compare sex and chol
df.plot.scatter(x='sex', y='chol');

In [0]:
# Creates a scatterplot to compare cp and trestbps
df.plot.scatter(x='cp', y='trestbps');

In [0]:
# Creates a scatterplot to compare fbs and restecg
df.plot.scatter(x='fbs', y='restecg');

In [0]:
# Creates a scatterplot to compare thalach and exang
df.plot.scatter(x='thalach', y='exang');

In [0]:
# Creates a scatterplot to compare oldpeak and slope
df.plot.scatter(x='oldpeak', y='slope');

In [0]:
# Creates a density plot of the dataset
df.plot.density();

In [0]:
# Creates a density plot using just age that checks for minor trends
df['age'].plot.density(bw_method=0.1);

In [0]:
# Creates a density plot using just age
df['age'].plot.density();

In [0]:
# Creates a density plot using just age that fixes over-exaggeration
df['age'].plot.density(bw_method=0.5);

In [0]:
# Creates a density plot using just cholesterol data that checks for minor trends
df['chol'].plot.density(bw_method=0.1);

In [0]:
# Creates a density plot using just cholesterol data
df['chol'].plot.density();

In [0]:
# Creates a density plot using just cholesterol data that fixes over-exaggeration
df['chol'].plot.density(bw_method=0.5);

## 5) Create at least one visualization from a crosstab:

Remember that a crosstab is just a dataframe and can be manipulated in the same way by row index, column, index, or column/row/cell position.


In [0]:
# Tabulates data between age and diagnosis of heart disease (num) and sets it to a variable
# For num: Value 0: < 50% diameter narrowing, Value 1: > 50% diameter narrowing
df_crosstab_01 = pd.crosstab(df['age'], df['num'])
print(df_crosstab_01)

In [0]:
# Uses a histogram plot to compare crosstab data
df_crosstab_01.plot.hist();

In [0]:
# Uses a density plot to compare crosstab data checking for minor trends
df_crosstab_01.plot.density(bw_method=0.1);


In [0]:
# Uses a density plot to compare crosstab data
df_crosstab_01.plot.density();

In [0]:
# Uses a density plot to compare crosstab data that fixes overexaggeration 
df_crosstab_01.plot.density(bw_method=0.5)

In [0]:
# Uses a line plot to compare crosstab data
df_crosstab_01.plot.line();

## Stretch Goals 

The following additional study tasks are optional, they are intended to give you an opportunity to stretch yourself beyond the main requirements of the assignment. You can pick and choose from the below, you do not need to complete them in any particular order.

### - Practice Exploring other Datasets

### -  Try using the Seaborn plotting library's "Pairplot" functionality in order to explore all of the possible histograms and scatterplots of your dataset all at once:

[Seaborn Pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

### - Turn some of the continuous variables into categorical variables by binning the values using:
- [pd.cut()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html)
- [pd.qcut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html)
- <https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut>

And then use crosstabs to compare/visualize these binned variables against the other variables.


### - Other types and sources of data
Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. Image, text, or (public) APIs are probably more tractable - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.

In [0]:
import seaborn as sns

In [0]:
# Creates pairplot that compares the read-in dataset 
sns.pairplot(df);

In [0]:
# Cuts data for age into 22 bins
df_age_cut = pd.cut(df['age'], 22)

In [0]:
# Cuts data for cholesterol into 22 bins
df_chol_cut = pd

In [0]:
# Compares counts of the df_age_cut bins to the age data from the dataset to 
# determine reasonablility of the bin data
print(df_age_cut.value_counts())
print(df['age'].value_counts())
