# Section D1 - Pandas Data Import and Selection
**Index**
* Creating a Dataframe
* Selecting Columns by Name
* Selecting Rows and Columns with loc and iloc
* Selecting Rows with a mask
* Selecting Rows with .where
* Using .iterrows

Pandas has two types of objects, **DataFrames** and **Series**.  A dataframe has rows and columns, like a spreadsheet - two dimensional.  A single row or column from a dataframe is a Series.  If we select a single column from a DataFrame, we get a series, a single dimensional object, and a series can be inserted into a df column. 

By convention, we'll import pandas as "pd" to save us some typing.

    import pandas as pd

 It's also common to call a single dataframe that we're working on "df", but it's a good idea to use a longer more descriptive name for complex tasks.

    df = pd.read_csv('my_data.csv')

There is functionality built into pd, as well as the dataframe and series objects that we create that we will use to manipulate the dataframe and series.  For example, we use these DataFrame functions a lot to view our data:

    df.info()  # show a summary of columns and data types in the dataframe. 
    df.head()  # show the top few rows of the dataframe.
    df.tail()  # few bottom rows
    df.describe()
    ...and more

And there are functions we call from pd to manipulate the dataframes:

    big_df = pd.concat(a_list_of_small_dataframes)  # concatenate dataframes together
    ...and more

## Creating a Dataframe
We can create an empty dataframe:

    df = pd.DataFrame()

But generally (or always) we'll want to load some data to make a dataframe. Common ways to do this follow. Reference the documentation to see optional arguments to use, like "skip_rows" to skip padding rows at the top of an excel or csv file, or use_cols to only import specific columns. 

**Excel Files** - https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

    df = pd.read_excel(file_name, ... engine ...)

**CSV Files or dat Files** - https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You may need to set the delimeter for some csv files. 

    df = pd.read_csv(file_name, ...)
    df = pd.read_table(file_name, ...)

**json Data** - https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
Useful for data loaded from the web.  This is what we use in the D1-Pandas_Example notebook.

    df = pd.read_json(json_data, ...)

**Dictionary of Lists to DataFrame**

In [3]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**list of dictionaries to DataFrame**
Same idea as above, but slightly different format.

In [13]:
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


#### *Exercise*

In the following code cell, use these functions to look at information about the dataframe:

    .info(), .describe(), and .head() 

And print thef following properties of the dataframe, like: `df.shape`

    .columns, .size, .shape

* What data type is each of the columns?
* How many rows and columns are there?
* What's the relationship between shape and size?
* Use a list comprehension to overwrite df.columns and make the comlumn names upper case.  `df.columns = [... ... df.comumns]`

Scroll through the DataFrame documentation to get an idea of what methods are built into it: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

In [4]:
# *We'll use this "df" for a few exercises below, so make sure to run this cell before continuing.*
df = pd.read_csv("https://raw.githubusercontent.com/a8ksh4/python_workshop/main/SAMPLE_DATA/iris.csv")
# You can also try saving iris.csv in the directory with your notebook and opening it from a local path.

In [22]:
# Your code here.  You can re-run the above cell if you mess up your dataframe.
# print(df....)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_inches
0,5.1,3.5,1.4,0.2,Iris-setosa,2.007875
1,4.9,3.0,1.4,0.2,Iris-setosa,1.929135
2,4.7,3.2,1.3,0.2,Iris-setosa,1.850395
3,4.6,3.1,1.5,0.2,Iris-setosa,1.811025
4,5.0,3.6,1.4,0.2,Iris-setosa,1.968505


## Selecting Columns by name:
We can select a single column by passing it's name in brackets, like: `df['column_name']`

And we can select multiple columns by passing a list of column names in nested brackets: `df[['column1', 'column2', ...]]`

This is a bit like string or list slicing, but using names or lists of names to take a selection of the available columns.

We can use this to both get values from columns or to assign values directly into one or more columns, or to create new columns of some name.

In [None]:
# a single column is a series object, so sepal_lenghts is a series.
sls = df['sepal_length']
print('Some of the sepal lenghths are:\n', sls)
print('All the lenghts are:\n', list(sls))

#### *Exercise*
Just like we did for the dataframe above, let's explore this "sls" series object.

* Use the `.info(), .shape, .size` properties to learn about the object. 
* And Let's try some more interesting functions built into series objects: `.sum(), .value_counts(), .mean()`
* Check if the series is greater than 3.  What is returned?  This list of True/False values is important for a future concept, "masks", for selecting rows.
* Scroll through some of the methods listed in the series documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

### Creating and manipulating columns of data:
We can perform mathematical operations on columns of data and put the result into a new or overwrite an existing column.  For example, if we want to add a column with units inches instead of cm:

In [None]:
df['sepal_length_inches'] = df['sepal_length'] * 0.393701

length_columns = sorted([c for c in df.columns if 'length' in c])
print('length comparison:\n', df[length_columns])

When you perform operations on a column, like multiplying the 'sepal_length' column by 0.393, that operation is broadcast across all rows in the column.  

And when we perform operation aginst two columns, each row in the columns is matched with the same index row in the other column for the operation, as with the width_differenc calculation below.

We can also select multiple columns py passing the columns in [], like: `df[['petal_length', 'petal_width']]`

In [None]:
df['width_difference'] = (df['sepal_width'] - df['petal_width']).abs()

# Alternate ways of selecting and printing columns are commented out below:

# width_columns = df.columns[df.columns.str.contains('width')]
# width_columns = ['sepal_width', 'petal_width', 'width_difference']
width_columns = sorted([c for c in df.columns if 'width' in c])

print('Widths:')
# print(df[['sepal_width', 'petal_width', 'width_difference']])
print(df[width_columns])

## Selecting Rows with loc and iloc
**.loc** vs **.iloc**
* .loc selects rows with particular labels in the series or dataframe index
  * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
* .iloc selects rows at integer locations within the series or dataframe.
  * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

In [4]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 
             'Grace', 'Hannah', 'Isaac', 'Jack'],
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'SSN': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012', 
            '567-89-0123', '678-90-1234', '789-01-2345', '890-12-3456', 
            '901-23-4567', '123-45-5789'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 
             'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
})
df = df.set_index('SSN')
print(df.head())

                Name  Age         City
SSN                                   
123-45-6789    Alice   25     New York
234-56-7890      Bob   30  Los Angeles
345-67-8901  Charlie   35      Chicago
456-78-9012    David   40      Houston
567-89-0123      Eve   45      Phoenix


Since we set the index of our dataframe to the 'SSN' column, we can use loc to print rows with a specific SSN, or lists of SSNs:

In [78]:
print('A single row:\n', 
      df.loc['345-67-8901'])
print('A list of rows by SSN and a slice of columns from Age to City:\n', 
      df.loc[['345-67-8901','456-78-9012'], 'Age':'City'])
print('A range of rows by SSN:\n',
      df.loc['345-67-8901':'567-89-0123', 'City'])

A single row:
 Name      Charlie
Age            35
City    Saskatoon
Name: 345-67-8901, dtype: object
A list of rows by SSN:
              Age       City
SSN                        
345-67-8901   35  Saskatoon
456-78-9012   40    Houston
A range of rows by SSN:
 SSN
345-67-8901    Saskatoon
456-78-9012      Houston
567-89-0123      Phoenix
Name: City, dtype: object


And we can include a column name to print specific values or to set them:

In [None]:
some_ssn = '345-67-8901'
print(f'{some_ssn} lives in:', df.loc[some_ssn, 'City'])
df.loc[some_ssn, 'City'] = 'Saskatoon'
print('Or was it:', df.loc[some_ssn, 'City'])

#### *Exercise*:
A few people have moved, please update their addresses:
* People with SSNs '678-90-1234' and '789-01-2345'  didn't pay their taxes and are singing the blues in Folsom. 
* People with SSNs '890-12-3456', '901-23-4567', and '123-45-5789' are retiring and moved to Palm Beach.
How would you do each of these one at a time with a loop, or all at once in a single operation?  

### loc selection of rows and columns
Rather than selecting by index value with loc, we can use iloc to select by row address, like 0, 1 or 2, a list of addresses, [1, 2, 3], or a range of addresses, [2:6]. And same for the columns returned.  A few examples:

In [72]:
print('Row 0:\n', 
      df.iloc[0])
print('\nRows 2 and 5 and Age column:\n', 
      df.iloc[[2,5], 1])
print('\nRows 2:6 and columns 0 and 1 using slices:\n', 
      df.iloc[2:7, :2])

Row 0:
 Name       Alice
Age           25
City    New York
Name: 123-45-6789, dtype: object

Rows 2 and 5 and Age column:
 SSN
345-67-8901    35
678-90-1234    50
Name: Age, dtype: int64

Rows 2:6 and columns 0 and 1 using slices:
                 Name  Age
SSN                      
345-67-8901  Charlie   35
456-78-9012    David   40
567-89-0123      Eve   45
678-90-1234    Frank   50
789-01-2345    Grace   55



Just like with loc, we can assign values to rows and columns selected using .loc, and we can capture those selections in new dataframes as needed. 

Also notice that the SSN index is shown...   if you do a .reset_index, you'd instead see a new numerical index instead of the SSNs. 
We'll look more at the index below.

#### *Exercise*
Studies have shown that older people tend to be more fun than younger people. 
* Use iloc to creat two new dataframes called 'top_five' and 'bottom_five' from the top and bottom five rows from 'df'.  
* Calculate the average age of each group and determine which group is likely to be the most fun!  You can compute the average of a column using .mean()... something like foo['col_name'].mean(). 

Do the cities that each group of people live in corroborate the results of the study, or is this silly?

## Using .iterrows() to iterate over rows
.iterrows() returns an iterator that we can pair with a for loop to look at each row one at a time.  This isn't in the spirit of pandas, which would prefer that we do something to all of the rows at the same time, but it can be very useful. 

In [94]:
for row_index, row_vals in df.iterrows():
    # print out the name, city, and age of the person in this row:
    # print(row[1]['Name'], 'lives in', row[1]['City'], 'and is', row[1]['Age'], 'years old.')
    # the [1] is 
    print('Row_index:', row_index)
    print(row_vals['Name'], 'lives in', row_vals['City'], 'and is', row_vals['Age'], 'years old.')
    
    if row_vals['Name'].startswith('A'):
        df.loc[row_index, 'Name'] = df.loc[row_index, 'Name'] + ' was here'
        
    print('We can use loc to get the name from the same row:', df.loc[row_index, 'Name'])
    print()

Row_index: 0
Alice lives in New York and is 25 years old.
We can use loc to get the name from the same row: Alice was here

Row_index: 1
Bob lives in Los Angeles and is 30 years old.
We can use loc to get the name from the same row: Bob

Row_index: 2
Charlie lives in Saskatoon and is 35 years old.
We can use loc to get the name from the same row: Charlie

Row_index: 3
David lives in Houston and is 40 years old.
We can use loc to get the name from the same row: David

Row_index: 4
Eve lives in Phoenix and is 45 years old.
We can use loc to get the name from the same row: Eve

Row_index: 5
Frank lives in Philadelphia and is 50 years old.
We can use loc to get the name from the same row: Frank

Row_index: 6
Grace lives in San Antonio and is 55 years old.
We can use loc to get the name from the same row: Grace

Row_index: 7
Hannah lives in San Diego and is 60 years old.
We can use loc to get the name from the same row: Hannah

Row_index: 8
Isaac lives in Dallas and is 65 years old.
We can 

#### *Exercise*
Use .reset_index() on the df and then iterrows again to see what is changed.  

## Selecting rows with a mask
A mask is a way to say "give me the rows where this condition is true."  In pandas, you create the mask by writing a conditional statement resulting in a list of true/false values.  Each true/false corresponds with a row in the dataframe. Applying the mask gives you only the rows with a corresponding true value.

We'll look at conditional statements and then a mask example.

### Conditional statements
Here are a few examples of conditional statements:
* Their age is greater than 30:
  * `df['Age'] > 30`
* Their name contains the letter 'a' and they are older than 40:
  * `df['Name'].str.lower().str.contains('a') & (df['Age'] > 40)`
* They are older than 50 or younger than 30:
  * `(df['Age'] > 50) | (df['Age'] <= 30)`

Note that rather than "and" and "or" in regular python code, we use "&" and "|" when comparing pandas series.  These are python bitwise operators. 

* Bitwise And: `a & b`
* Bitwise Exclusive Or: `a ^ b`
* Bitwise Inversion (not): `~ a`
* Bitwise Or: `a | b`

And when using & and |, we need to put parenthesees around the other expressions to make sure they are evaluated before the bitwise operators.  
* This will error:
  * `df['Age'] > 50 | df['Age'] <= 30`
* This is correct:
  * `(df['Age'] > 50) | (df['Age'] <= 30)`

https://introcs.cs.princeton.edu/python/appendix_precedence/#:~:text=Order%20of%20Evaluation,the%20and%20or%20or%20operators.
https://docs.python.org/3/library/operator.html#mapping-operators-to-functions

### Example use of a mask to select some rows:
Let's select all people/rows from our dataframe where their age is > 45:

In [None]:
mask_over_45 = df['Age'] > 45
# mask_under_eq_45 = ~mask_over_45  # example of inverting/negating a mask
# mask_under_eq_45 = df['Age'] <= 45  # this is equivelant to the line above
df_over_45 = df[mask_over_45]
# df_over_45 = df[df['Age'] > 45]  # this is equivelant to above.
print(mask_over_45)
print(df_over_45)

#### *Exercise*
Use conditional statements to make a mask and check .value_counts() on it to see how many people:
* Are older than 60
* Have social security numbers starting with '4'
* Live in Philatelphia or are named Hannah
* Do not live in Dallas

## Using .query to select rows
.query lets us use a sql like syntax to select rows.  This is nice becaues it can be more readable than a conditional statement for a mask, it might be better to use a mask for cases like:
* Your column names have special characters
* You are generating your query/condition programatically
* You are using operations like .str.contains or other functions in your query.

Documentation and a few good examples: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

In [16]:
filtered_df = df.query('Age > 30 and City == "Chicago"')
print(filtered_df)

                Name  Age     City
SSN                               
345-67-8901  Charlie   35  Chicago


## The Dataframe Index

#

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/a8ksh4/python_workshop/main/SAMPLE_DATA/titaninc.csv")
# Note that by default, an arbitrary numerical index is assigned to the rows.
# That default index would match exactly with the numeric address of each row, 
# so it is not useful for this example. 
# We instead set the passenger ID as the index - loc refers to this, and iloc 
# refers to the literal numerical address of each row. 
df = df.set_index('PassengerId')

#### *Exercise*
Use iloc to show these views of the titanic passengers:
* The 4th through 6th passengers
* Even numbered passenger rows (not even PassengerId) and columns 1:4.

## Using .apply for arbitrary operations









## Note Regading inplace=True
changed_dataframe = df.some_modification()

Pandas is phasing out inplace modification.  It can still be done by passing the 'inplace=True'





## Concatenation
When we read in multiple files, we can concatenate them into a single dataframe.  
Example should show adding an identifier row and pulling date from file name.

## Join Operations

## Stack and Unstack (sort of like a povit table)
**Stack** - This function pivots the columns of a DataFrame into its index, effectively "stacking" the data vertically. It converts a DataFrame from a wide format to a long format.
**Unstack** - This is the reverse of stack. It pivots the index of a DataFrame back into columns, converting it from a long format to a wide format.

What does this mean and why!!!???

## Plotting

## Exporting files
### Plain Excel
### Multiple Sheets Excel
### Other