# Lesson 2 CSCI 3022


Practice with Exploratory Data Analysis and Intro to Pandas to accompany Lesson 2


# Intro to Exploratory Data Analysis

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Key Considerations in Exploratory Data Analysis:

- 1).   Structure -- what is the “shape” of a data file?

- 2). Granularity -- what type of data does each record represent? how fine/coarse is each record in your data?

- 3). Scope -- does the data cover the target population?

- 4). Temporality -- how is the data situated in time?

- 5).  Faithfulness -- how well does the data capture “reality”


## Dataset: United States Presidential Election Data
For our first analysis we will analyze some data from US Presidential elections.  This data is already stored in `data/elections.csv`

***
## EDA - 1).  What is the Structure of the Data?

We refer to a dataset’s **structure** as a mental representation of the data, and in particular, we represent data that have a tabular structure by arranging values in rows and columns. 

**Guiding Questions When Examining Data Structure**
     
   - What is the size of the data?
   - What type of file is it? (Do we trust this file extension?)
   - Are the data organized in records or nested?
   - Can we define records by parsing the data?
   - Can we reasonably un-nest the data?
   - Does the data reference other data?
   - Can we join/merge the data? (Do we need to)?
   - What are the fields in each record?
    - How are they encoded?  (e.g., strings, numbers, binary, dates …)
     - Datatype/Storage type: How each variable value is stored in memory. 
        - integer, floating point, boolean, object (string-like), etc.
        - Affects which pandas functions you use.
     - Variable type/Feature type of the data for our purposes:
        - Conceptualized measurement of information (and therefore what values it can take on).
            - Use expert knowledge; Explore data itself; Consult data codebook (if it exists).

     


## Pandas module:
***
**Pandas** is an open source $\color{red}{\text{data analysis module}}$ in Python used for storing, cleaning, wrangling, and analyzing data.   (Fun fact: It was named as a shortcut for the term "$\textbf{pan}$el  $\textbf{da}$ta", a common term for multidimensional data sets encountered in statistics and econometrics.)





First, let's import the Pandas module.  It's custom in data science to import Pandas with the alias $\texttt{pd}$.  We can then access any function in the Pandas libraries by prepending function names by $\texttt{pd.}$  

In [1]:
import pandas as pd

### $\color{red}{\textbf{Pandas}}$ Data Structures




Pandas has three types of data structures: 
- **Series**: A one dimensional array with labeled indices (can be mixed data types). 
-  **DataFrame**: 2D tabular data structure with both row and column labels.  $\color{red}{\text{Rows}}$ have a specific index to access them, which can be $\color{red}{\text{any name or value}}$. The $\color{blue}{\text{columns}}$ are just $\color{blue}{\text{Pandas Series}}$. The Pandas DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. 
-  **Index**:  A sequence of row/column labels


![pandas-DataStructure.jpg](attachment:pandas-DataStructure.jpg)

### Loading Data Into a DataFrame:

Panda's [read_csv function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is one of the most versatile and useful functions for managing data.  

Since we're loading a csv file,  the data is already in tabular format, and each row represents a record of election data for a specific party for a given year, we don't have to add any additional inputs to the function for this file:

In [7]:
elections = pd.read_csv("data/elections.csv")


### Viewing Data in DataFrames:

Two useful methods for viewing dataframes are:

`.head()`(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)


`.tail()`


In [9]:
#Default of the .head() method is to show the first 5 rows. If you want to see n rows, enter .head(n)

elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,2024,Kamala Harris,Democratic,75019230,loss,48.34
1,2024,Donald Trump,Republican,77303568,win,49.81
2,2024,Jill Stein,Green,861155,loss,0.6
3,2024,Robert F. Kennedy Jr.,Independent,756383,loss,0.6
4,2024,Chase Oliver,Libertarian,650130,loss,0.4


## Determine what each variable in your dataset represents:

Ideally columns in the dataset are  named in a way that clearly explains that they represent.  If not, you will want to refer to the data's codebook (if one exists).

For this particular dataset the columns represent the following:



|Column|Description|
| --- | --- |
|Year| Year of the election | 
|Candidate| Candidate who ran| 
|Party | Party of candidate|
|Popular vote | Number of popular votes candidate received |
|Result | Whether the candidate won or lost the election |
|% | The percentage of popular votes the candidate received|



### Datatype/Storage Types
It's important to check if the variable type corresponds to how you would interpret the data.  Sometimes quantitative data is loaded as a string (and needs to be converted) or sometimes data that appears quantitative (1, 2, 3) is actually a code to represent a qualitative feature.  We will dive more deeply into how we conceptualize variable types when we discuss visualizing data in the next lesson.

A quick way to view the datatypes of all your columns is the 

`.info()` method which outputs the column integer positions, column labels, data types, memory usage, and the number of non-null cells in each column 


In [4]:
# Practice:  Call the .info() method on the elections dataframe
elections.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          182 non-null    int64  
 1   Candidate     182 non-null    object 
 2   Party         182 non-null    object 
 3   Popular vote  182 non-null    int64  
 4   Result        182 non-null    object 
 5   %             182 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 8.7+ KB


You can also use
`df[colname].dtype`

The `object` datatype in python indicates string or mixed data.  


### Variable/Feature Types for our purposes:
 - We'll discuss this more in a future lesson.  This is important for determining how to visualize the data and how/if to use it for modeling.  
   

***
## EDA  - 2).  What is the Granularity of the data?

We use the term granularity to describe the level of measurement that uniquely identifies each record in the table.  

For example, does it represent a measurement from a unique person/event?  An aggregated measurement?   

Data that has a high level of granularity would have a large number of individual pieces of information, such as individual records or measurements. Data that has a low level of granularity would have a small number of individual pieces of information, such as summary data or aggregated data. Data granularity can affect how it is used and analyzed, and can impact the accuracy and usefulness of the results.


**Guiding Questions To Consider:**
 - What is the granularity of the dataset?
 - Do all records capture granularity at the same level?
   - Some data will include summaries (aka rollups) as records
 - If the data has a low level of granularity (i.e. has been aggregated in some way), how were the records aggregated?




In [3]:
elections.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,2020,Joseph Biden,Democratic,81268924,win,51.311515
1,2020,Donald Trump,Republican,74216154,loss,46.858542
2,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979
3,2020,Howard Hawkins,Green,405035,loss,0.255731
4,2016,Darrell Castle,Constitution,203091,loss,0.14964


**Practice:  Based on these 5 rows, what does the granularity of the dataset appear to be?**

To explore this further, we need to select just the columns that uniquely identify each record:

# Extraction:

One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of interest.   

There are Three Ways We Can Extract Data:



Index(['Year', 'Candidate', 'Party', 'Popular vote', 'Result', '%'], dtype='object')

### Label-Based Extraction Using`loc`

`loc` selects items by row and column *label*.  

`df.loc[row_labels, column_labels]`

We describe "labels" as the bolded text at the top and left of a DataFrame.




Arguments to `.loc` can be:
1. A row label and column label
2. A list.
3. A slice (syntax is inclusive of the right-hand side of the slice).

In [6]:
# Here's how we can select all rows and just the Year and Party columns from the elections dataframe.
# Note we use the ellipsis (:) in the first entry because we want to select all rows

elections.loc[:,["Year","Party"]]

Unnamed: 0,Year,Party
0,2020,Democratic
1,2020,Republican
2,2020,Libertarian
3,2020,Green
4,2016,Constitution
...,...,...
177,1832,Anti-Masonic
178,1828,Democratic
179,1828,National Republican
180,1824,Democratic-Republican


### Shortcut: Context-Dependent Extraction

In practice, the `[]` operator is often used to yield more concise code.

`[]` is a bit trickier to understand than `.loc` or `.iloc`, but it achieves essentially the same functionality. The difference is that `[]` is *context-dependent*.

`[]` only takes one argument, which may be:
1. A slice of row integers.
2. A list of column labels.
3. A single column label.


In [13]:
# Here's a shortcut for selecting all rows and just the columns Year and Party
elections[["Year","Party"]]

Unnamed: 0,Year,Party
0,2020,Democratic
1,2020,Republican
2,2020,Libertarian
3,2020,Green
4,2016,Constitution
...,...,...
177,1832,Anti-Masonic
178,1828,Democratic
179,1828,National Republican
180,1824,Democratic-Republican


**Read the documentation for the value_counts() method and then use it to determine if the combination of Year and Party uniquely identifies each record:**

#### Useful Utility Function


`.value_counts()`  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html


**Practice: Read the documentation for value_counts() and use it to determine if the combination of Year and Party uniquely identifies each record**

In [16]:
elections.loc[:,["Party","Year"]].value_counts()

Party                  Year
Whig                   1836    2
Democratic-Republican  1824    2
American               1856    1
Republican             1876    1
Prohibition            1920    1
                              ..
Farmer–Labor           1920    1
Free Soil              1848    1
                       1852    1
Green                  1996    1
Whig                   1852    1
Name: count, Length: 180, dtype: int64

**What do you observe?  What does this tell you about the granularity**?

Granularity of this dataset is A specific presidential candidate in a specific election year


In [18]:
elections[["Candidate","Year"]].value_counts()

Candidate         Year
Aaron S. Watkins  1920    1
Lenora Fulani     1988    1
Lewis Cass        1848    1
Lyndon Johnson    1964    1
Martin Van Buren  1836    1
                         ..
George W. Bush    2004    1
George Wallace    1968    1
Gerald Ford       1976    1
Grover Cleveland  1884    1
Zachary Taylor    1848    1
Name: count, Length: 182, dtype: int64

Fun facts:

https://en.wikipedia.org/wiki/1824_United_States_presidential_election

https://en.wikipedia.org/wiki/1836_United_States_presidential_election

In [None]:
...

### Summary:  Why do we care about granularity?

Understanding the shape and granularity of a table gives us insight into what a row in a data table represents. This helps us determine whether the granularity is mixed, aggregation is needed, or weights are required. 

After looking at the granularity of your dataset, you should have answers to the following questions:
 - What does a record represent? Clarity on this will help you correctly carry out a data analysis and state your findings.

 - Do all records in a table capture granularity at the same level? Sometimes a table contains additional summary rows that have a different granularity, and you want to use only those rows that are at the right level of detail.
 - If the data are aggregated, how was the aggregation performed? Summing and averaging are common types of aggregation. With averaged data, the variability in the measurements is typically reduced and relationships often appear stronger.
 - What kinds of aggregations might you perform on the data? Aggregations might be useful or necessary to combine one data table with another.

Knowing your table’s granularity is a first step to cleaning your data, and it informs you of how to analyze the data. 

***
## EDA 3-5:  Scope, Temporality and Faithfulness


### SCOPE - What is the scope of the data?  (how (in)complete is the data?)

Scope includes considering the target population we want to study, how to access information about that population, and what your given datasets are actually measuring.

**Guiding Questions to Consider**
 - Does the data cover the target population?  
 - We will need to filter the data before using it? (Is it too expansive)?
 - Do we need to gather additional data before proceeding?
 

### Temporality -- how is the data situated in time?

**Guiding Questions To Consider**:

 - When was the data collected/last updated?
 - What is the meaning of any time and date fields? 
    - For our particular dataset, see the explanation of an MLB season:  https://www.fubo.tv/news/how-to-watch/how-long-is-the-mlb-season/
 - Are there strange date null values (e.g. January 1st 1970, January 1st 1900…?, etc)
 - Is there periodicity? Diurnal (24-hr), Monthly or Yearly patterns? 


### EDA -Faithfulness -- how well does the data capture “reality”?

**Guiding Questions To Consider**:

 - Does the data contain unrealistic or “incorrect” values?
 - Is there any missing data?
 - Does my data violate obvious dependencies?
 - Are there obvious signs of data falsification?

#### Useful DataFrame Utility Functions


`df.shape`

`df.describe()`

`df.isna()`

#### Useful Series Utility Functions

`series.unique()`

**Practice:  Practice with each of the utility functions above and/or read their pandas documentation.  Then explain what each one does.**

In [None]:
# Practice with the utility functions above.   Then explain what each method does.

...

...

...



**Practice: Using the applicable utility function(s), determine how many unique years of election data we have in this dataset, and when it begins and ends.**

In [19]:
# Hint: We can select ONE column from a DataFrame as a Series using the following:
elections["Party"]

0                 Democratic
1                 Republican
2                Libertarian
3                      Green
4               Constitution
               ...          
177             Anti-Masonic
178               Democratic
179      National Republican
180    Democratic-Republican
181    Democratic-Republican
Name: Party, Length: 182, dtype: object

**Practice:  Is there any  missing or unexpected data values?  Explain**

In [20]:
elections.describe()

Unnamed: 0,Year,Popular vote,%
count,182.0,182.0,182.0
mean,1934.087912,12353640.0,27.47035
std,57.048908,19077150.0,22.968034
min,1824.0,100715.0,0.098088
25%,1889.0,387639.5,1.219996
50%,1936.0,1709375.0,37.677893
75%,1988.0,18977750.0,48.354977
max,2020.0,81268920.0,61.344703


In [21]:
elections["Year"].unique()

array([2020, 2016, 2012, 2008, 2004, 2000, 1996, 1992, 1988, 1984, 1980,
       1976, 1972, 1968, 1964, 1960, 1956, 1952, 1948, 1944, 1940, 1936,
       1932, 1928, 1924, 1920, 1916, 1912, 1908, 1904, 1900, 1896, 1892,
       1888, 1884, 1880, 1876, 1872, 1868, 1864, 1860, 1856, 1852, 1848,
       1844, 1840, 1836, 1832, 1828, 1824])

In [23]:
(2020-1824)/4+1

50.0

In [24]:
elections["Year"].nunique()

50