# Lesson 1 CSCI 3022


Practice with Exploratory Data Analysis and Intro to Pandas to accompany Lesson 1


# Intro to Exploratory Data Analysis

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Key Considerations in Exploratory Data Analysis:

- **1).   Structure -- what is the “shape” of a data file?**

- 2). Granularity -- what type of data does each record represent? how fine/coarse is each record in your data?

- 3). Scope -- does the data cover the target population?

- 4). Temporality -- how is the data situated in time?

- 5).  Faithfulness -- how well does the data capture “reality”


## Dataset: United States Presidential Election Data
For our first analysis we will analyze some data from US Presidential elections.  This data is already stored in `data/elections.csv`

***
## EDA - 1).  What is the Structure of the Data?

We refer to a dataset’s **structure** as a mental representation of the data, and in particular, we represent data that have a tabular structure by arranging values in rows and columns. 

**Guiding Questions When Examining Data Structure**
     
   - What is the size of the data?
   - What type of file is it? (Do we trust this file extension?)
   - Are the data organized in records or nested?
   - Can we define records by parsing the data?
   - Can we reasonably un-nest the data?
   - Does the data reference other data?
   - Can we join/merge the data? (Do we need to)?
   - What are the fields in each record?
    - How are they encoded?  (e.g., strings, numbers, binary, dates …)
     - Datatype/Storage type: How each variable value is stored in memory. 
        - integer, floating point, boolean, object (string-like), etc.
        - Affects which pandas functions you use.
     - Variable type/Feature type for modeling and visualization:
        - Conceptualized measurement of information (and therefore what values it can take on).
            - Use expert knowledge; Explore data itself; Consult data codebook (if it exists).

     


### How big is the data?
I often like to start my analysis by getting a rough estimate of the size of the data. This will help inform the tools I use and how I view the data. If it is relatively small I might use a text editor or a spreadsheet to look at the data. If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.

However here we will use python tools to probe the file:


In [None]:
#the following line imports a module called os (https://docs.python.org/3/library/os.html)
import os

#We use functions in the os module to get the size of the data and print the result:
print("data/elections.csv is", os.path.getsize("data/elections.csv") / 1e6, "MB")

I might also want to investigate the number of lines, which often corresponds to the number of records:

In [None]:
with open("data/elections.csv", "r") as f:
    print("data/elections.csv", "is", sum(1 for l in f), "lines.")

We can see this is not a huge file.

<br>

---

# Exploring CSV Files


We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
1. Using a the jupyter lab explorer tool look at the data
2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
3. The Python file object
4. pandas, using `pd.read_csv()`

<br>


---

## Play with the data in the Jupyter Lab Explorer
1, 2. Let's start with the first two so we really solidify the idea of a CSV as **rectangular data (i.e., tabular data) stored as comma-separated values**.

<br>

---

## Play with the data in python

3. Next, let's try using the Python file object. Let's check out the first four lines:

In [None]:
with open("data/elections.csv", "r") as f:
    for i, row in enumerate(f):
        print(row)
        if i >= 3: break

Whoa, why are there blank lines interspaced between the lines of the CSV?

All line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.

If you're curious, we can use the `repr()` function to return the raw string with all special characters:

In [None]:
with open("data/elections.csv", "r") as f:
    for i, row in enumerate(f):
        print(repr(row)) # print raw strings
        if i >= 3: break

As data gets bigger it will be important to read only the parts you need into the notebook.

<br/>

---

4. Finally, let's see the tried-and-true CSCI 3022 approach: **pandas**.

## Pandas module:
***
**Pandas** is an open source $\color{red}{\text{data analysis module}}$ in Python used for storing, cleaning, wrangling, and analyzing data.   (Fun fact: It was named as a shortcut for the term "$\textbf{pan}$el  $\textbf{da}$ta", a common term for multidimensional data sets encountered in statistics and econometrics.)





First, let's import the Pandas module.  It's custom in data science to import Pandas with the alias $\texttt{pd}$.  We can then access any function in the Pandas libraries by prepending function names by $\texttt{pd.}$  

In [None]:
import pandas as pd

### $\color{red}{\textbf{Pandas}}$ Data Structures




Pandas has three types of data structures: 
- **Series**: A one dimensional array with labeled indices (can be mixed data types). 
-  **DataFrame**: 2D tabular data structure with both row and column labels.  $\color{red}{\text{Rows}}$ have a specific index to access them, which can be $\color{red}{\text{any name or value}}$. The $\color{blue}{\text{columns}}$ are just $\color{blue}{\text{Pandas Series}}$. The Pandas DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. 
-  **Index**:  A sequence of row/column labels


![pandas-DataStructure.jpg](attachment:pandas-DataStructure.jpg)

### Loading Data Into a DataFrame:

Panda's [read_csv function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is one of the most versatile and useful functions for managing data.  

Since we're loading a csv file,  the data is already in tabular format, and each row represents a record of election data for a specific party for a given year, we don't have to add any additional inputs to the function for this file:

In [None]:
elections = pd.read_csv("data/elections.csv")


### Viewing Data in DataFrames:

Two useful methods for viewing dataframes are:

`.head()`(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)


`.tail()`


In [None]:
#Default of the .head() method is to show the first 5 rows. 
#If you want to see n rows, enter .head(n)

elections.head()


**Practice:  Select the last 8 rows of the DataFrame:**

In [None]:
# Select the last 8 rows of the DataFrame:
...

## Determine what each variable in your dataset represents:

Ideally columns in the dataset are  named in a way that clearly explains that they represent.  If not, you will want to refer to the data's codebook (if one exists).

For this particular dataset the columns represent the following:



|Column|Description|
| --- | --- |
|Year| Year of the election | 
|Candidate| Candidate who ran| 
|Party | Party of candidate|
|Popular vote | Number of popular votes candidate received |
|Result | Whether the candidate won or lost the election |
|% | The percentage of popular votes the candidate received|



**Practice: List 2 different questions you could try to answer using this dataset:**

### Datatype/Storage Types
It's important to check if the variable type corresponds to how you would interpret the data.  Sometimes quantitative data is loaded as a string (and needs to be converted) or sometimes data that appears quantitative (1, 2, 3) is actually a code to represent a qualitative feature.  We will dive more deeply into how we conceptualize variable types when we discuss visualizing data in the next lesson.

A quick way to view the datatypes of all your columns is the 

`.info()` method which outputs the column integer positions, column labels, data types, memory usage, and the number of non-null cells in each column 


In [None]:
# Practice:  Call the .info() method on the elections dataframe
elections.info()

You can also use
`df[colname].dtype`

The `object` datatype in python indicates string or mixed data.  


### Conceptual Variable/Feature Types for modeling:
 - We'll discuss this more in a future lesson.  This is important for determining how to visualize the data and how/if to use it for modeling.  
   