## Source: https://pandas.pydata.org/pandas-docs/stable/index.html

### What is Pandas? 
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

### What data structures does Pandas use?
Pandas has 2 primary data structures - namely Series(1-D) and DataFrame(2-D). 

### What to expect from this tutorial/notebook? 
This tutorial is intended to introduce you to Pandas from a high level and then expose you to
- Data Acquisition 
- Data Cleaning 
- Data Filtering 
- Data Aggregation 
- Data Analysis (depending on time availability)

### How is this different from the countless other materials that are publicly available? 
It is by no means exhaustive or extensive, rather you can consider it my share of learnings that I picked up and learned as I attempted to use Python. I will be sharing tips and tricks that I found to be helpful, but if you know something better, you are welcome to share it with me/us. 

#### How to create data-structures in Pandas ? 

In [23]:
# Import required libraries 
import pandas as pd
import numpy as np 
import os 
import pickle  
import csv 

In [6]:
# Creating a series 
my_numeric_series = pd.Series([2, 3, 5, 7], name="Primes_Under_10")
print(my_numeric_series)
my_character_series = pd.Series(["DesertPy", "SoCal_Python", "PyLadies_of_LA"], name="Some_Python_Meetups")
print(my_character_series)
my_mixed_series = pd.Series([2, "a", 4, "b"], name="Mixed_Series")
print(my_mixed_series)

0    2
1    3
2    5
3    7
Name: Primes_Under_10, dtype: int64
0          DesertPy
1      SoCal_Python
2    PyLadies_of_LA
Name: Some_Python_Meetups, dtype: object
0    2
1    a
2    4
3    b
Name: Mixed_Series, dtype: object


In [7]:
# Creating a data frame 
# Method 1 - from list of lists 
print("******************************")
print("Printing Dataframe from method 1")
print("******************************")

list_of_lists = [["Doug Ducey", "Arizona", 2023], ["Gavin Newsom", "California", 2023], 
                 ["Ron Desantis", "Florida", 2023], ["Andrew Cuomo", "New York", 2022],
                 ["Brian Kemp", "Georgia", 2023]]
governors_in_the_news_df = pd.DataFrame(data=list_of_lists, columns=["Name", "State", "Term_Expiry"])
print(governors_in_the_news_df)

print("******************************")
print("Printing Dataframe from method 2")
print("******************************")

# Method 2 - from dictionary of lists
dict_of_lists = {"Name": ["Jay Inslee", "Ned Lamont", "Andy Beshear", "Roy Cooper"],
                "State": ["Washington", "Connecticut", "Kentucky", "North Carolina"],
                "Term_Expiry": [2021, 2023, 2023, 2021]}
governors_df = pd.DataFrame(data=dict_of_lists)
print(governors_df)

print("******************************")
print("Printing Dataframe from method 3")
print("******************************")

# Method 3 - from list of dictionaries 
list_of_dicts = [{'USA': 50, 'Brazil': 26, 'Canada':10}]
states_in_countries_df = pd.DataFrame(data=list_of_dicts, index=["State_Count"])
print(states_in_countries_df)

print("******************************")
print("Printing Dataframe from method 4")
print("******************************")

# Method 4 - from lists with zip 
stock_symbols = ["AAPL", "AMZN", "V", "MA"]
prices_i_wish_i_bought_them_at = [50, 10, 1, 78]
stocks_i_wanted_df = pd.DataFrame(data=list(zip(stock_symbols, prices_i_wish_i_bought_them_at)),
                                  columns=["Stock_Symobl", "Dream_Price"]) 
print(stocks_i_wanted_df)

print("******************************")
print("Printing Dataframe from method 5")
print("******************************")

# Method 5 - dict of pd.Series 
dict_of_series = {'Place_I_Wanted_To_Be' : 
                    pd.Series(["New Zealand", "Fiji", "Bahamas"], index =["January",    "February", "March"]),                  'Place_I_Am_At' : 
                    pd.Series(["Home", "Home", "Home"], index =["January", "February", "March"])} 
lockdown_mood_df = pd.DataFrame(dict_of_series)
print(lockdown_mood_df)

******************************
Printing Dataframe from method 1
******************************
           Name       State  Term_Expiry
0    Doug Ducey     Arizona         2023
1  Gavin Newsom  California         2023
2  Ron Desantis     Florida         2023
3  Andrew Cuomo    New York         2022
4    Brian Kemp     Georgia         2023
******************************
Printing Dataframe from method 2
******************************
           Name           State  Term_Expiry
0    Jay Inslee      Washington         2021
1    Ned Lamont     Connecticut         2023
2  Andy Beshear        Kentucky         2023
3    Roy Cooper  North Carolina         2021
******************************
Printing Dataframe from method 3
******************************
             USA  Brazil  Canada
State_Count   50      26      10
******************************
Printing Dataframe from method 4
******************************
  Stock_Symobl  Dream_Price
0         AAPL           50
1         AMZN           10

## Takeaways-1
#### From the above examples, it is helpful to identify a few takeaways: 
- Series and DataFrame can represent most of the commonly used data sets. Constructing your data into a Series or a DataFrame allows you to leverage a lot of built-in functionality that Pandas offers 
- Series and DataFrame support homogeneous and heterogeneous data - meaning they can handle same data types as well as different data types 
- Series and DataFrame have an index property which defaults to an integer but can be set as desired (imagine time stamps, letters etc.)
- Pandas 1.0.0 deprecated the testing module and limited to only assertion functions. While not advisable, if you are using a version < 1.0.0, pandas.util.testing offers close to 30 different built-in functions to whip up different data frames that make it easy to test. You can get the list of possible functions like so 

```
import pandas.util.testing as tm 
dataframe_constructor_functions = [i for i in dir(tm) if i.startswith('make')]
print(dataframe_constructor_functions)
```

- While all of these are good to know, a typical use-case would not require a user to create data, rather import/acquire data from several different data sources - which leads us to our first topic of Data Acquisition 

## Data Acquisition 

#### One of the most powerful and appealing aspects of Pandas is its ability to easily acquire and ingest data from several different data sources including but not limited to: 
- CSV
- Text 
- JSON 
- HTML 
- Excel
- SQL

  An exhaustive list can be found here - https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [8]:
# CSV 
print("******************************")
print("Printing Dataframe from CSV")
print("******************************")

current_directory = os.getcwd()
confimed_global_cases_file_path = os.path.join(current_directory, "covid-19_data","time_series_covid19_confirmed_global.csv") 
confirmed_global_cases_df = pd.read_csv(filepath_or_buffer=confimed_global_cases_file_path) 
print(confirmed_global_cases_df.head()) 

# JSON 
print("******************************")
print("Printing Dataframe from JSON")
print("******************************")
json_file_path = os.path.join(current_directory, "sample_json.json")
json_df = pd.read_json(json_file_path)
print(json_df)

# Excel
print("******************************")
print("Printing Dataframe from Excel")
print("******************************")
# excel_file_path = os.path.join(current_directory, "covid-19_data","time_series_covid19_confirmed_recovered.xlsx")
os.chdir("covid-19_data")
recovered_global_cases_df = pd.read_excel("time_series_covid19_recovered_global.xlsx")
os.chdir(current_directory)
print(recovered_global_cases_df.head())

******************************
Printing Dataframe from CSV
******************************
  Province/State Country/Region      Lat     Long  1/22/20  1/23/20  1/24/20  \
0            NaN    Afghanistan  33.0000  65.0000        0        0        0   
1            NaN        Albania  41.1533  20.1683        0        0        0   
2            NaN        Algeria  28.0339   1.6596        0        0        0   
3            NaN        Andorra  42.5063   1.5218        0        0        0   
4            NaN         Angola -11.2027  17.8739        0        0        0   

   1/25/20  1/26/20  1/27/20  ...  3/24/20  3/25/20  3/26/20  3/27/20  \
0        0        0        0  ...       74       84       94      110   
1        0        0        0  ...      123      146      174      186   
2        0        0        0  ...      264      302      367      409   
3        0        0        0  ...      164      188      224      267   
4        0        0        0  ...        3        3        4    

## Takeaways-2

#### Based on the few limited examples above: 
- Pandas has several robust i/o parsers which makes it really easy to consume data from several different sources 
- If you are going to work with pandas, it is best to use pandas to acquire the data, as long as parser exists because they are optimized to handle large sets of data 
- What would have been a great example would be to consume SQL data, but since I don't have enough power on my machine to install a SQL Server program, I am forced to skip that - if you have access to a database, do try and consume data from the database. You would be needing either pyodbc or sqlalchemy or a similar package as a wrapper. 

## Data Cleaning  & Data Filtering (They are quite intertwined)

#### Messy (or) Unorganized data is very common. Even if the data is organized and logically makes sense, it might have missing data or NaN's or NA's which could be a hindrance to smooth analysis of your data. 

### Tip #1 - If you are going to modify a data frame and want the original data, create a copy 

In [17]:
# By copying a dataframe with deep flag turned on, a new data frame is created including a copy of the data and the indices. 
# Changes made to this new data frame are not reflected in the original data frame object and vice-versa 
confirmed_global_cases_copy_df = confirmed_global_cases_df.copy(deep=True)

# Let us we want to only look at countries where there is state/province-level data available 
confirmed_global_cases_copy_df = confirmed_global_cases_copy_df[confirmed_global_cases_copy_df["Province/State"].notnull()]

In [26]:
# Notice that the data frame index does not hold any special order now and it does not convey any meaning by itself, unless we reference it or compare with the original data frame. We can fix that by re-setting the index 
confirmed_global_cases_copy_df = confirmed_global_cases_copy_df.reset_index(drop=True) 
# drop flag prevents the column from being added back to the dataframe as a new column. also remember to assign the data frame back to your variable. resetting of index, returns an object and unless we capture it and re-assign to the same variable, the change is lost 

In [35]:
# Let us say, we only want countries in the Northern Hemisphere 
def hemisphere_flag(df):
    if (df["Lat"] >= 0):
        return 1
    else:
        return 0 
confirmed_global_cases_copy_df["Northern_Hemisphere_Flag"] = confirmed_global_cases_copy_df.apply(hemisphere_flag, axis=1)
northern_hemisphere_confirmed_cases_df = confirmed_global_cases_copy_df[confirmed_global_cases_copy_df["Northern_Hemisphere_Flag"] == 1]
northern_hemisphere_confirmed_cases_df = northern_hemisphere_confirmed_cases_df.reset_index(drop=True)