# Section D - Pandas

**Topics:** Pandas basics, includeng row and column selections, index, column names, data types and type-casting, and a bit more. 

The name "Pandas" comes from "Panel Data" and "Python Data Analysis". "Panel Data" refers to a particular type of data that is multidimensional, involving measurements over time. The term "Pandas" is a blend of these concepts, reflecting the library's purpose of providing data structures and data analysis tools in Python.

**Pandas** are playfull and memorable, just like **Pandas**!

Pandas has two types of objects, DataFrames and Series.  A dataframe has rows and columns, like a spreadsheet - two dimensional.  A single row or column from a dataframe is a Series.  If we select a single column from a DataFrame, we get a series, a single dimensional object, and a series can be inserted into a df column. 

By convention, we'll import pandas as "pd" to save us some typing.

    import pandas as pd

 It's also common to call a single dataframe we're working on "df", but it's a good idea to use a longer more descriptive name for complex tasks.

There is functionality built into pd, as well as the dataframe and series objects that we create that we will use to manipulate the dataframe and series.  For example, we use these DataFrame functions a lot to view our data:

    df.info()  # show a summary of columns and data types in the dataframe. 
    df.head()  # show the top few rows of the dataframe.
    df.tail()  # few bottom rows
    ...and more

And there are functions we call from pd to manipulate the dataframes:

    new_df = pd.concat(a_list_of_dataframes)  # concatenate dataframes together
    ...and more

## Creating a Dataframe
We can create an empty dataframe:

    df = pd.DataFrame()

But generally (or always) we'll want to load some data to make a dataframe. Common ways to do this follow. Reference the documentation to see optional arguments to use, like "skip_rows" to skip padding rows at the top of an excel or csv file, or use_cols to only import specific columns. 

**Excel Files** - https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

    df = pd.read_excel(file_name, ... engine ...)

**CSV Files or dat Files** - https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You may need to set the delimeter for some csv files. 

    df = pd.read_csv(file_name, ...)
    df = pd.read_table(file_name, ...)

**json Data** - https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
Useful for data loaded from the web.  This is what we use in the D1-Pandas_Example notebook.

    df = pd.read_json(json_data, ...)

**Dictionary of Lists to DataFrame**

In [5]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**list of dictionaries to DataFrame**
Same idea as above, but slightly different format.

In [13]:
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


#### *Exercise*

In the following code cell, use .info(), .describe(), and .head() to see what kind of data has been loaded into the dataframe.  

*We'll use this "df" for a few exercises below, so make sure to run this cell before continuing.*

In [15]:
import requests
import pandas as pd
import json

data_url = 'https://api.worldbank.org/v2/countries/USA/indicators/SP.POP.TOTL?per_page=5000&format=json'
population = requests.get(data_url)
population = json.loads(population.content)
population = population[1]
print(data_url)
# df = pd.read_json(population.json()[1])
df = pd.DataFrame(population)

df.head()

https://api.worldbank.org/v2/countries/USA/indicators/SP.POP.TOTL?per_page=5000&format=json


Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'US', 'value': 'United States'}",USA,2023,334914895,,,0
1,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'US', 'value': 'United States'}",USA,2022,333271411,,,0
2,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'US', 'value': 'United States'}",USA,2021,332048977,,,0
3,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'US', 'value': 'United States'}",USA,2020,331526933,,,0
4,"{'id': 'SP.POP.TOTL', 'value': 'Population, to...","{'id': 'US', 'value': 'United States'}",USA,2019,328329953,,,0



# How to approach learning pandas
Start with simple problems.  Import a clean excel file. 

## Basics
We use pandas to make **DataFrame**s and **Series**.  A single column of a dataframe is a series, and it has some different built in functinoality than a dataframe.  

## Pandas Objects - DataFrames and Series
The DataFrame is the primary pandas opject we will work with. 

## Note Regading inplace=True
changed_dataframe = df.some_modification()

Pandas is phasing out inplace modification.  It can still be done by passing the 'inplace=True'

## Selecting Columns
a_series = df['some_col'] 
a_dataframe = df[['a_col', 'another_col']]

## Selecting Rows

## Iterating over rows

## Type Conversions
Freqently string to numeric
String to datetime

### Datetime Conversions

## String Operations


## Concatenation
When we read in multiple files, we can concatenate them into a single dataframe.  
Example should show adding an identifier row and pulling date from file name.

## Join Operations

## Stack and Unstack (sort of like a povit table)
**Stack** - This function pivots the columns of a DataFrame into its index, effectively "stacking" the data vertically. It converts a DataFrame from a wide format to a long format.
**Unstack** - This is the reverse of stack. It pivots the index of a DataFrame back into columns, converting it from a long format to a wide format.

What does this mean and why!!!???

## Plotting

## Exporting files
### Plain Excel
### Multiple Sheets Excel
### Other

# Creating a Dataframe
Just skim over this for the general idea on how it works, and come back to each method for importing data as you need it. 

## Empty Dataframe
Why would we want an empty dataframe?  I think it's generally not needed... but maybe there's a good case for starting with an empty df... 

    df = pd.DataFrame()

## From a CSV file

    df = pd.read_csv('data.csv')

## From an Excel file
The sheet name is only needed if we have multiple sheets in the .xlsx.

    df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

## From a list of lists or tuples
We need to specify the column names in this case:

    data = [[1, 2], [3, 4], [5, 6]]
    df = pd.DataFrame(data, columns=['A', 'B'])

## From a dictionary 
The dictionary keys are the **column** names, and the each list is a column of data. 

    data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
    df = pd.DataFrame(data)

## From a database
Note that a database connection, called "conn" here, is a pretty standard thing.  You can create a connection to many database types and pass the connectin and query to pd.read_sql_query and it will just work.  Sqlite3 is a file based database that doesn't require a server to host it. 

    import sqlite3

    conn = sqlite3.connect('database.db')
    df = pd.read_sql_query('SELECT * FROM table_name', conn)

## From an html table
Note that you can also generate html tables from dataframes... 


