# Section D - Pandas

**Topics:** Pandas basics, includeng row and column selections, index, column names, data types and type-casting, and a bit more. 

The name "Pandas" comes from "Panel Data" and "Python Data Analysis". "Panel Data" refers to two dimensoinal data, often including measurements over time - time series - or collections of things/events. The term "Pandas" is a blend of these concepts, reflecting the library's purpose of providing data structures and data analysis tools in Python.

**Pandas** are playfull and memorable, just like **Pandas**!

Pandas has two types of objects, **DataFrames** and **Series**.  A dataframe has rows and columns, like a spreadsheet - two dimensional.  A single row or column from a dataframe is a Series.  If we select a single column from a DataFrame, we get a series, a single dimensional object, and a series can be inserted into a df column. 

By convention, we'll import pandas as "pd" to save us some typing.

    import pandas as pd

 It's also common to call a single dataframe that we're working on "df", but it's a good idea to use a longer more descriptive name for complex tasks.

    df = pd.read_csv('my_data.csv')

There is functionality built into pd, as well as the dataframe and series objects that we create that we will use to manipulate the dataframe and series.  For example, we use these DataFrame functions a lot to view our data:

    df.info()  # show a summary of columns and data types in the dataframe. 
    df.head()  # show the top few rows of the dataframe.
    df.tail()  # few bottom rows
    df.describe()
    ...and more

And there are functions we call from pd to manipulate the dataframes:

    big_df = pd.concat(a_list_of_small_dataframes)  # concatenate dataframes together
    ...and more

## Creating a Dataframe
We can create an empty dataframe:

    df = pd.DataFrame()

But generally (or always) we'll want to load some data to make a dataframe. Common ways to do this follow. Reference the documentation to see optional arguments to use, like "skip_rows" to skip padding rows at the top of an excel or csv file, or use_cols to only import specific columns. 

**Excel Files** - https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

    df = pd.read_excel(file_name, ... engine ...)

**CSV Files or dat Files** - https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You may need to set the delimeter for some csv files. 

    df = pd.read_csv(file_name, ...)
    df = pd.read_table(file_name, ...)

**json Data** - https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
Useful for data loaded from the web.  This is what we use in the D1-Pandas_Example notebook.

    df = pd.read_json(json_data, ...)

**Dictionary of Lists to DataFrame**

In [3]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**list of dictionaries to DataFrame**
Same idea as above, but slightly different format.

In [13]:
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


#### *Exercise*

In the following code cell, use these functions to look at information about the dataframe:

    .info(), .describe(), and .head() 

And print thef following properties of the dataframe, like: `df.shape`

    .columns, .size, .shape

* What data type is each of the columns?
* How many rows and columns are there?
* What's the relationship between shape and size?
* Use a list comprehension to overwrite df.columns and make the comlumn names upper case.  `df.columns = [... ... df.comumns]`

Scroll through the DataFrame documentation to get an idea of what methods are built into it: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

In [4]:
# *We'll use this "df" for a few exercises below, so make sure to run this cell before continuing.*
df = pd.read_csv("https://raw.githubusercontent.com/a8ksh4/python_workshop/main/SAMPLE_DATA/iris.csv")
# You can also try saving iris.csv in the directory with your notebook and opening it from a local path.

In [22]:
# Your code here.  You can re-run the above cell if you mess up your dataframe.
# print(df....)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_inches
0,5.1,3.5,1.4,0.2,Iris-setosa,2.007875
1,4.9,3.0,1.4,0.2,Iris-setosa,1.929135
2,4.7,3.2,1.3,0.2,Iris-setosa,1.850395
3,4.6,3.1,1.5,0.2,Iris-setosa,1.811025
4,5.0,3.6,1.4,0.2,Iris-setosa,1.968505


## Selecting Columns by name:
We can select a single column by passing it's name in brackets, like: `df['column_name']`

And we can select multiple columns by passing a list of column names in nested brackets: `df[['column1', 'column2', ...]]`

This is a bit like string or list slicing, but using names or lists of names to take a selection of the available columns.

We can use this to both get values from columns or to assign values directly into one or more columns, or to create new columns of some name.

In [None]:
# a single column is a series object, so sepal_lenghts is a series.
sls = df['sepal_length']
print('Some of the sepal lenghths are:\n', sls)
print('All the lenghts are:\n', list(sls))

#### *Exercise*
Just like we did for the dataframe above, let's explore this "sls" series object.

* Use the `.info(), .shape, .size` properties to learn about the object. 
* And Let's try some more interesting functions built into series objects: `.sum(), .value_counts(), .mean()`
* Check if the series is greater than 3.  What is returned?  This list of True/False values is important for a future concept, "masks", for selecting rows.
* Scroll through some of the methods listed in the series documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

### Creating and manipulating columns of data:
We can perform mathematical operations on columns of data and put the result into a new or overwrite an existing column.  For example, if we want to add a column with units inches instead of cm:

In [None]:
df['sepal_length_inches'] = df['sepal_length'] * 0.393701

length_columns = sorted([c for c in df.columns if 'length' in c])
print('length comparison:\n', df[length_columns])

When you perform operations on a column, like multiplying the 'sepal_length' column by 0.393, that operation is broadcast across all rows in the column.  

And when we perform operation aginst two columns, each row in the columns is matched with the same index row in the other column for the operation, as with the width_differenc calculation below.

We can also select multiple columns py passing the columns in [], like: `df[['petal_length', 'petal_width']]`

In [None]:
df['width_difference'] = (df['sepal_width'] - df['petal_width']).abs()

# Alternate ways of selecting and printing columns are commented out below:

# width_columns = df.columns[df.columns.str.contains('width')]
# width_columns = ['sepal_width', 'petal_width', 'width_difference']
width_columns = sorted([c for c in df.columns if 'width' in c])

print('Widths:')
# print(df[['sepal_width', 'petal_width', 'width_difference']])
print(df[width_columns])

## Selecting Rows with loc and iloc
**.loc** vs **.iloc**
* .loc selects rows with particular labels in the series or dataframe index
  * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
* .iloc selects rows at integer locations within the series or dataframe.
  * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

In [51]:
# generate dataframe with ten people wih random ages, social security numbers, ages, cities, and sex:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 
             'Grace', 'Hannah', 'Isaac', 'Jack'],
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'SSN': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012', 
            '567-89-0123', '678-90-1234', '789-01-2345', '890-12-3456', 
            '901-23-4567', '123-45-5789'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 
             'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
})
print(df)

      Name  Age          SSN          City
0    Alice   25  123-45-6789      New York
1      Bob   30  234-56-7890   Los Angeles
2  Charlie   35  345-67-8901       Chicago
3    David   40  456-78-9012       Houston
4      Eve   45  567-89-0123       Phoenix
5    Frank   50  678-90-1234  Philadelphia
6    Grace   55  789-01-2345   San Antonio
7   Hannah   60  890-12-3456     San Diego
8    Isaac   65  901-23-4567        Dallas
9     Jack   70  123-45-5789      San Jose


In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/a8ksh4/python_workshop/main/SAMPLE_DATA/titaninc.csv")
# Note that by default, an arbitrary numerical index is assigned to the rows.
# That default index would match exactly with the numeric address of each row, 
# so it is not useful for this example. 
# We instead set the passenger ID as the index - loc refers to this, and iloc 
# refers to the literal numerical address of each row. 
df = df.set_index('PassengerId')

In [40]:
df.head(4)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,PassId
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,4


So lets print row number 2 using iloc, and the passenger with PassengerId 2 using loc:

In [44]:
print('loc:\n', df.loc[2]['Name'])
print('iloc:\n', df.iloc[2]['Name'])

loc:
 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
iloc:
 Heikkinen, Miss. Laina


### loc selectoin of rows and columns


### iloc selection of rows and columns
iloc can multiple rows and columns by their address and slices of rows and columns:

* **Multiple rows**: `df.iloc[[2,3,4]]
* **Multiple rows and cols**: `df.iloc[[2,3], [0,1,2]]
* **Slice of rows**:
  * `df.iloc[2:5]`
  * `df.iloc[:5]`
* **slice of rows and cols**: `df.iloc[1:3, 1:4]`

A simple example of use of this might be if I wanted to split my data into a training set and a testing set for some machine learning prediction algorithm.  I would randomize order of the data, then select 70% of the rows for training and 30% for testing:

In [45]:
# Ramdomize order of the rows
df_randomized = df.sample(frac=1)

# figure ou thow many rows we need:
training_size = int(len(df_randomized) * 0.7)
testing_size = len(df_randomized) - training_size

# split the data
df_trianing = df_randomized.iloc[:training_size]
df_testing = df_randomized.iloc[training_size:]

#### *Exercise*
Use iloc to show these views of the titanic passengers:
* The 4th through 6th passengers
* Even numbered passenger rows (not even PassengerId) and columns 1:4.

## Iterating Over rows

In [None]:
for row in df.iterrows():
    print(row)


## Type Conversions
**String to Numeric**
**String to Datetime**
**Datetime to Numeric**

## String Operations

## Using .apply for arbitrary operations









## Note Regading inplace=True
changed_dataframe = df.some_modification()

Pandas is phasing out inplace modification.  It can still be done by passing the 'inplace=True'





## Concatenation
When we read in multiple files, we can concatenate them into a single dataframe.  
Example should show adding an identifier row and pulling date from file name.

## Join Operations

## Stack and Unstack (sort of like a povit table)
**Stack** - This function pivots the columns of a DataFrame into its index, effectively "stacking" the data vertically. It converts a DataFrame from a wide format to a long format.
**Unstack** - This is the reverse of stack. It pivots the index of a DataFrame back into columns, converting it from a long format to a wide format.

What does this mean and why!!!???

## Plotting

## Exporting files
### Plain Excel
### Multiple Sheets Excel
### Other

# Creating a Dataframe
Just skim over this for the general idea on how it works, and come back to each method for importing data as you need it. 

## Empty Dataframe
Why would we want an empty dataframe?  I think it's generally not needed... but maybe there's a good case for starting with an empty df... 

    df = pd.DataFrame()

## From a CSV file

    df = pd.read_csv('data.csv')

## From an Excel file
The sheet name is only needed if we have multiple sheets in the .xlsx.

    df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

## From a list of lists or tuples
We need to specify the column names in this case:

    data = [[1, 2], [3, 4], [5, 6]]
    df = pd.DataFrame(data, columns=['A', 'B'])

## From a dictionary 
The dictionary keys are the **column** names, and the each list is a column of data. 

    data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
    df = pd.DataFrame(data)

## From a database
Note that a database connection, called "conn" here, is a pretty standard thing.  You can create a connection to many database types and pass the connectin and query to pd.read_sql_query and it will just work.  Sqlite3 is a file based database that doesn't require a server to host it. 

    import sqlite3

    conn = sqlite3.connect('database.db')
    df = pd.read_sql_query('SELECT * FROM table_name', conn)

## From an html table
Note that you can also generate html tables from dataframes... 


