Lambda School Data Science

*Unit 1, Sprint 1, Module 1*

---



# Exploratory Data Analysis

- Student can submit assignments via GitHub (save work to GitHub).
- Student can load a dataset (CSV) from a URL using `pandas.read_csv()`
- Student can load a dataset (CSV) from a local file using `pandas.read_csv()`
- Student can use basic pandas EDA functions like: `pandas.describe()`, `pandas.isnull()`, `pandas.value_counts()`, `pandas.crosstab()`.
- Student can generate basic visualizations with Pandas: line plot, histogram, scatterplot, density plot.

# [Objective](#save-to-github) - Save a .ipynb file (Colab Notebook) to GitHub



## Overview

GitHub is a website where you can save code or other files either for personal use or for sharing with others. The website is used primarily for storing "open-source" project files so that users can work together on large code bases without overwriting each other's work. You will be using GitHub to collaborate on large projects, both with other students and in your career. 

In order to help you get familiar with this tool we have structured our assignment submission process around the typical GitHub workflow to try and mimic how this tool is used. The following process is the workflow that you will follow in order to submit your assignments so that the Team Leads can view your work and give you daily feedback.

## Follow Along

### 1) Fork the Repository for that Sprint at the beginning of the Sprint

**NOTE: You will only do this step a single time at the beginning of each sprint.**

Go to <http://github.com/lambdaschool>

All of our data science curriculum can be accessed through this page.

In the search bar start typing:

`DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling`

Repositories that don't match what you are typing in the search bar will be filtered out, eventually leaving this sprint's repository. 

> "Repository" is fancy work that just means: folder where we are going to store some files on GitHub. You'll hear people say "repo" for short.

**At the beginning of each sprint you will need to find that sprint's corresponding repository and "fork" it to your personal GitHub account.** "Forking" a repository is GitHub lingo for "Make a copy." If you click the fork button on the top right corner of the webpage, GitHub will make a copy of the folder of files that we will be using for that sprint to your personal GitHub account. You will be doing your work and saving your changes to the copied version on your account.

You can tell when you have successfully forked a repository because you should briefly see an animation appear that looks like a book is being photocopied with a fork stuck in it and then you will be redirected to your copy of the repository.

You can always tell when you're looking at the forked version on your personal github account by looking at the name of the repository and looking at the username that is just to the left of it in the filepath:

![Forked Repository Username Screenshot](https://lambdachops.com/img/fork-repository-screenshot.png)


### 2) Open one of the files and make a change to it. 

The files that we will be working with primarily during the course have the file extension: .ipynb for "IPython Notebook" any of these are notebooks that we can open in Google Colab.

To open one of these notebook files in Google Colab go to:

<https://colab.research.google.com/github/> 

If you haven't done so already, give Google permission to access your GitHub account from your Google Account.

Once you have all of the permissions sorted out, select the repository that you're most interested in from the dropdown menu. Once you select a repository Google Colab will look through it to find all of the .ipynb files and will list them below:

![Open .ipynb file from GitHub in Google Colab](https://lambdachops.com/img/google-colab-github.png)

If you don't like going to this link everyday to open your notebooks, there is also a Google Chrome extension that you can use to easily open any .ipynb file from GitHub directly in Google Colab: 

### [Google Chrome Extension to Open .ipynb files easily in Google Colab](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo?hl=en)

### 3) Save your changes back to your forked repository on Github.

Once you have finished making all of the changes that you want to the notebook, you can save your work back to GitHub by selecting `File` >> `Save a copy in GitHub` from the dropdown menu. 

When you select this a new tab will open in your browser to show you the saved file on GitHub to let you know that the save has been completed successfully.

![Save A Copy In Github](https://lambdachops.com/img/save-a-copy-in-github.png)

### You will do steps 2 and 3 of this process every day as you work on your assignment work, however, you will only do steps 1 and 4 at the beginning of a sprint.

### 4) Submit a "Pull Request" of your work.

The final step in submitting your work is to open a "Pull Request" GitHub won't allow you to complete this step until you have saved some changes to your version of the repository on GitHub.

Opening a Pull Request is something that only needs to be done once per week (typically at the beginning of the week). This pull request is what ties your work back to the original Lambda School repository and makes it easy for the Team Leads to find your work.

In order to open a pull request, navigate to your repository on GitHub and select the "Pull Requests" tab at the top of the page.

![Pull Requests Tab](https://lambdachops.com/img/pull-request.png)

To open a new pull request you will need to click the green "New Pull Request" button and give your pull request a title. Please include your name and Cohort number i.e. DS8, DS9, or DS10, etc. at the beginning of the pull request title so that the Team Leads can easily identify your Pull Request. Once you have filled out the title, just click the remaining large green buttons until the pull request has been submitted.


### In Summary

1) Fork the Repository (make a copy to your personal account)

2) Open the Repository in Google Colab and make changes to the files (work on your assignment).

3) Save the changes back to github using the dropdown menu.

4) Make sure that sometime before the end of the first day of the sprint that you have submitted a Pull Request so that the TLs can find your work.

## Challenge

You'll have to do follow this process or one very close to it every day/week for the next nine months. If this feels a little bit overwhelming at first, don't worry about it! We will be doing this everyday and you have your Team Leads and classmates to lean on for help. You'll be a pro at using GitHub in no time.

If you're already familiar with GitHub and or Git via the command line, feel free to use the tools that you are most comfortable with, but you still need to save your work to GitHub every day.

# [Objective](#load-csv-from-url) - Load a dataset (CSV) via its URL

## Overview

In order to practice Loading Datasets into Google Colab, we're going to use the [Flags Dataset](https://archive.ics.uci.edu/ml/datasets/Flags) from UCI to show both loading the dataset via its URL and from a local file.

Steps for loading a dataset:

1) Learn as much as you can about the dataset:
 - Number of rows
 - Number of columns
 - Column headers (Is there a "data dictionary"?)
 - Is there missing data?
 - **OPEN THE RAW FILE AND LOOK AT IT. IT MAY NOT BE FORMATTED IN THE WAY THAT YOU EXPECT.**

2) Try loading the dataset using `pandas.read_csv()` and if things aren't acting the way that you expect, investigate until you can get it loading correctly.

3) Keep in mind that functions like `pandas.read_csv()` have a lot of optional parameters that might help us change the way that data is read in. If you get stuck, google, read the documentation, and try things out.

4) You might need to type out column headers by hand if they are not provided in a neat format in the original dataset. It can be a drag.

## Follow Along

### Learn about the dataset and look at the raw file.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [0]:
# Find the actual file to download
# From navigating the page, clicking "Data Folder"
# Right click on the link to the dataset and say "Copy Link Address"

flag_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data'

# You can "shell out" in a notebook for more powerful tools
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

# Funny extension, but on inspection looks like a csv
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data

# Extensions are just a norm! You have to inspect to be sure what something is

In [0]:
names=['name','landmass','zone','area','population','language', 'religion', 'bars',
       'stripes', 'colours','red', 'green','blue','gold','white','black','orange',
       'mainhue','circles','crosses','saltires','quarters','sunstars','crescent',
       'triangle','icon','animate','text','topleft','botright']
#list of columns taken from the datasheet docs


### Attempt to load it via its URL

In [0]:
# Load the flags dataset from its URL:
df=pd.read_csv(flag_data_url,names=names)

### If things go wrong, investigate and try to figure out why.


In [10]:
# Different ways to look at the documentation:
print(f"na values:\n{df.isna().sum()}\n\nnull values:\n{df.isnull().sum()}")

na values:
name          0
landmass      0
zone          0
area          0
population    0
language      0
religion      0
bars          0
stripes       0
colours       0
red           0
green         0
blue          0
gold          0
white         0
black         0
orange        0
mainhue       0
circles       0
crosses       0
saltires      0
quarters      0
sunstars      0
crescent      0
triangle      0
icon          0
animate       0
text          0
topleft       0
botright      0
dtype: int64

null values:
name          0
landmass      0
zone          0
area          0
population    0
language      0
religion      0
bars          0
stripes       0
colours       0
red           0
green         0
blue          0
gold          0
white         0
black         0
orange        0
mainhue       0
circles       0
crosses       0
saltires      0
quarters      0
sunstars      0
crescent      0
triangle      0
icon          0
animate       0
text          0
topleft       0
botright      0
dt

### Try Again

In [0]:
# looks pretty clean to me

## Challenge

You'll get very good at reading documentation, Googling, asking for help, troubleshooting, debugging, etc. by the time you're done here at Lambda School. Our goal is to turn you into a data scientist that can solve their own problems. 

# [Objective](#load-csv-from-file) - Load a dataset (CSV) from a local file

## Overview

We won't always have CSVs hosted on the interwebs for us. We need to be able to upload files from our local machines as well. With Google Colab this is trickier than it is with other software (like Jupyter Notebooks for example. Because the main file system backing Google Colab is Google Drive, we can't use a filepath to the file on our computers in order to access our data. We have to upload our files to Google Colab before we can start working with them.

## Follow Along

### Method 1: Google Colab File Upload Package
- What should we google to try and figure this out?

In [13]:
from google.colab import files
import io
uploaded=files.upload()


Saving flag (1).data to flag (1).data


In [0]:
df=pd.read_csv(io.BytesIO(uploaded['flag (1).data']))

### Method 2: Use the GUI (Graphical User Interface)

In [0]:
# not sure how i show this in a code cell

## Challenge

On the assignment this afternoon you'll get to choose a new dataset and try both of these methods, we will load hundreds of datasets into notebooks by the time the class is over, you'll be pro at it in no time.

# [Objective](#basic-pandas-functions) - Use basic Pandas functions for Exploratory Data Analysis (EDA)

## Overview

> Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations

Exploratory Data Analysis is often the first thing that we'll do when starting out with a new dataset. How we treat our data, the models we choose, the approach we take to analyzing our data and in large part the entirety of our data science methodology and next steps are driven by the discoveries that we make during this stage of the process. 

## Follow Along

What can we discover about this dataset?

- df.shape
- df.head()
- df.dtypes
- df.describe()
 - Numeric
 - Non-Numeric
- df['column'].value_counts()
- df.isnull().sum()
- df.fillna()
- df.dropna()
- df.drop()
- pd.crosstab()

In [0]:
# Lets try reading in a new dataset: The Adult Dataset
# https://archive.ics.uci.edu/ml/datasets/adult
test_url='https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
train_url='https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'


## Challenge

Hopefully a lot of the above functions are review for you from the precourse material, but if not, again, don't worry. We'll be using these again on the assignment and most days of class -whenever we need to wrap our head around a new dataset.

# [Objective](#pandas-visualizations) Generate Basic Visualizations (graphs) with Pandas

## Overview

One of the cornerstones of Exploratory Data Analysis (EDA) is visualizing our data in order to understand their distributions and how they're interrelated. Our brains are amazing pattern detection machines and sometimes the "eyeball test" is the most efficient one. In this section we'll look at some of the most basic kinds of "exploratory visualizations" to help us better understand our data.

## Follow Along

Lets demonstrate creating a:

- Line Plot
- Histogram
- Scatter Plot
- Density Plot
- Making plots of our crosstabs

How does each of these plots show us something different about the data? 

Why might it be important for us to be able to visualize how our data is distributed?

### Line Plot

### Histogram

### Scatter Plot


### Density Plot - Kernel Density Estimate (KDE)

### Plotting using Crosstabs

## Challenge

These are some of the most basic and important types of data visualizations. They're so important that they're built straight into Pandas and can be accessed with some very concise code. At the beginning our data exploration is about understanding the characteristics of our dataset, but over time it becomes about communicating insights in as effective and digestable a manner as possible, and that typically means using graphs in one way or another. See how intuitive of a graph you can make using a crosstab on this dataset.

# Review

Whew, that was a lot. Again, if this content seems overwhelming, remember that this won't be the last time that we'll talk about the skills contained in this lesson. They're right at the beginning of the course because we'll use these skills nearly every day, so you'll get really good at these things in no time!

You know when you're learning a new board game and somebody tries to explain the rules to you and it doesn't make very much sense? My friends always end up saying something like: "It sounds more complicated then it really is, lets just play a round and you'll get it." 

That's the same message that I have for you. There's a lot of new things here at Lambda School in the first week:

- New Course
- New Schedule
- New Community
- New Tools
- New Processes
- New Content

As we go through a cycle of one sprint, it will all start making a whole lot more sense. 

---

Your assignment for this afternoon can be found in the -other- notebook inside the module folder in this week's repository on GitHub. You are going to pick another [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) dataset and do much of the same as what we have done above.

In order to start out with something that won't be impossibly hard, please use one of the datasets that is listed as "Most Popular" on the right side of the UCI website.

Why am I **not** assigning a specific dataset to you for your assignment? As a baby step in getting you more comfortable with open-endedness. Traditional education has been training you to expect there to be a single correct solution to things -that's rarely the case in data science. There are pros and cons to every decision that we make. Over the course of the first unit, we will work on helping you be comfortable open-endedness as we navigate the sea of tradeoffs that exist when we approach data. and you choosing the dataset for your assignment is the first tiny step that we're going to take in that direction. 

Assignment Notebook:

