From the documentation.  https://pandas.pydata.org/docs/getting_started/overview.html

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), 
handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.<p>
    Here are just a few of the things that pandas does well:

- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data

- Intuitive merging and joining data sets

- Flexible reshaping and pivoting of data sets

- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn

# Series and Dataframes

## Series

**Series** is a one-dimensional labeled array capable of holding any data type. The **axis labels** are collectively referred to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

### Create a Series

In [16]:
# Use the Series method: s = pd.Series(data, index=index)
# Shift + Tab t osee other parameters

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
print(s.loc["a"])
s

0.4215451705744081


a    0.421545
b    0.575668
c    0.231970
d   -0.086098
e   -1.020312
dtype: float64

In [19]:
# Index is optional

s = pd.Series(randn(5))  # Don't need np.random because randn was imported.
print(s.loc[0])
s

-0.9115110633198596


0   -0.911511
1    0.460580
2   -0.065870
3   -0.158830
4    0.093133
dtype: float64

In [4]:
# Index is optional

s = pd.Series(randn(5))  # Don't need np.random because randn was imported.
s

0    0.539593
1   -0.612803
2    0.523122
3   -1.027583
4    0.599541
dtype: float64

In [20]:
# A list, array or dictionary can be used to create a series.

my_list = [5,3,0]
my_arr = np.array([5,3,0])
my_dictionary = {'a':5,'b':3,'c':0}

In [26]:
# Use a list w/o an index

pd.Series(my_dictionary)


a    5
b    3
c    0
dtype: int64

In [7]:
# Use a list w/ an index

pd.Series(my_list, index=['a','b','c'])

a    5
b    3
c    0
dtype: int64

In [27]:
# Use a list w/ a list for the index
i_names = [['a','b','c']]

pd.Series(my_list, i_names)

a    5
b    3
c    0
dtype: int64

In [9]:
# Use an array
my_arr = np.array([5,3,0])
pd.Series(my_arr, index=['a','b','c'])

a    5
b    3
c    0
dtype: int64

In [10]:
# Using strings
my_cities = ['Chicago','Atlanta','Boston']
pd.Series(my_cities, i_names)

a    Chicago
b    Atlanta
c     Boston
dtype: object

In [11]:
# Use the cities as the labels
my_cities = ['Chicago','Atlanta','Boston']
state = ['IL','GA','MA']
cities = pd.Series(state, my_cities)
cities

Chicago    IL
Atlanta    GA
Boston     MA
dtype: object

### Using the Series index

In [12]:
cities['Chicago']

'IL'

## Dataframes

**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

### Create a DataFrame

In [13]:
np.random.seed(1234)  
df = pd.DataFrame(randn(4,5),index=['IL','GA','MA','VT'],columns=['Sent','Used','Expired','Lost','Destroyed'])
df

Unnamed: 0,Sent,Used,Expired,Lost,Destroyed
IL,0.471435,-1.190976,1.432707,-0.312652,-0.720589
GA,0.887163,0.859588,-0.636524,0.015696,-2.242685
MA,1.150036,0.991946,0.953324,-2.021255,-0.334077
VT,0.002118,0.405453,0.289092,1.321158,-1.546906


In [14]:
# A little shortcut
np.random.seed(1234)  
df = pd.DataFrame(randn(4,5),index='IL GA MA VT'.split(),columns='S U E L D'.split())
df

Unnamed: 0,S,U,E,L,D
IL,0.471435,-1.190976,1.432707,-0.312652,-0.720589
GA,0.887163,0.859588,-0.636524,0.015696,-2.242685
MA,1.150036,0.991946,0.953324,-2.021255,-0.334077
VT,0.002118,0.405453,0.289092,1.321158,-1.546906


In [15]:
# Create a DataFrame

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
sales = pd.DataFrame(data)
sales

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


### Using the DataFrame index

In [16]:
df

Unnamed: 0,S,U,E,L,D
IL,0.471435,-1.190976,1.432707,-0.312652,-0.720589
GA,0.887163,0.859588,-0.636524,0.015696,-2.242685
MA,1.150036,0.991946,0.953324,-2.021255,-0.334077
VT,0.002118,0.405453,0.289092,1.321158,-1.546906


In [17]:
# Select a column
df['S']

IL    0.471435
GA    0.887163
MA    1.150036
VT    0.002118
Name: S, dtype: float64

In [18]:
# Select multiple columns
df[['S','E']]            # Outer brackets: [ expecting an arguement] inner brackets: passing in a list ['a','b']

Unnamed: 0,S,E
IL,0.471435,1.432707
GA,0.887163,-0.636524
MA,1.150036,0.953324
VT,0.002118,0.289092


In [19]:
# Getting a row
df.loc['IL']

S    0.471435
U   -1.190976
E    1.432707
L   -0.312652
D   -0.720589
Name: IL, dtype: float64

In [20]:
df.iloc[0]

S    0.471435
U   -1.190976
E    1.432707
L   -0.312652
D   -0.720589
Name: IL, dtype: float64

In [21]:
df.iloc[1:3]

Unnamed: 0,S,U,E,L,D
GA,0.887163,0.859588,-0.636524,0.015696,-2.242685
MA,1.150036,0.991946,0.953324,-2.021255,-0.334077


# Ingest Data

## Ingest data

https://pandas.pydata.org/docs/user_guide/io.html


In [22]:
# https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/transactions.csv
# https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers.csv

# https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers_no_header.csv

### Reading csv files

In [23]:
import numpy as np
import pandas as pd
from numpy.random import randn


In [24]:
transactions = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/transactions.csv')
customers = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers.csv')

In [25]:
transactions.head()

Unnamed: 0,transaction_id,customer_id,order_date,channel,store_id,payment_method,card_bank,subtotal,discount_amount,tax_amount,shipping_amount,total_amount,promo_code_used,device_type,fulfillment_method,return_flag
0,4001,1001,1/5/24 10:30,Online,,Credit Card,Chase,189.97,20.0,15.2,0.0,185.17,SPRING24,Mobile,Ship,False
1,4002,1001,1/12/24 14:45,Store,3002.0,Credit Cards,Citi,299.96,0.0,24.0,0.0,323.96,,,In-Store,False
2,4003,1003,1/8/24 11:20,Online,,PayPal,,149.99,15.0,11.2,5.99,151.18,WELCOME10,Desktop,Ship,False
3,4004,1003,1/15/24 16:00,Mobile App,,Apple Pay,,69.99,0.0,5.6,0.0,75.59,,Mobile,BOPIS,False
4,4005,1002,1/7/24 13:30,Store,3001.0,credit card,,79.99,0.0,6.4,0.0,86.39,,,In-Store,False


### read_csv options

In [26]:
df = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers_no_header.csv', 
                 header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,F,Occasional,3/15/22,Social Media,892.5,Low
1,1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,1/20/21,Google,4587.3,Low
2,1003,Sarah,Williams,sarah.w@email.com,Houston,TX,42,F,Frequent,6/10/21,Email,2245.8,Medium
3,1004,James,Davis,james.d@email.com,San Antonio,TX,31,M,Occasional,2/28/23,Direct,445.6,High
4,1005,Maria,Garcia,maria.g@email.com,Austin,TX,26,F,New,1/5/24,Instagram,125.4,Medium


In [27]:
df = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers_no_header.csv', 
                 header = None, 
                names = ('trans_id','cust_id','order_date','channel','store_id','payment_method','card_bank','subtotal','discount_amount',
                        'tax_amount','shipping_amount','total_amount','promo_code_used','device_type','fulfillment_method',
                         'return_flag'))
df.head()

Unnamed: 0,trans_id,cust_id,order_date,channel,store_id,payment_method,card_bank,subtotal,discount_amount,tax_amount,shipping_amount,total_amount,promo_code_used,device_type,fulfillment_method,return_flag
0,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,F,Occasional,3/15/22,Social Media,892.5,Low,,,
1,1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,1/20/21,Google,4587.3,Low,,,
2,1003,Sarah,Williams,sarah.w@email.com,Houston,TX,42,F,Frequent,6/10/21,Email,2245.8,Medium,,,
3,1004,James,Davis,james.d@email.com,San Antonio,TX,31,M,Occasional,2/28/23,Direct,445.6,High,,,
4,1005,Maria,Garcia,maria.g@email.com,Austin,TX,26,F,New,1/5/24,Instagram,125.4,Medium,,,


In [28]:
# Override existing column names

df = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers_no_header.csv', 
                 header = None, 
                names = ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p'))
df.head()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
0,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,F,Occasional,3/15/22,Social Media,892.5,Low,,,
1,1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,1/20/21,Google,4587.3,Low,,,
2,1003,Sarah,Williams,sarah.w@email.com,Houston,TX,42,F,Frequent,6/10/21,Email,2245.8,Medium,,,
3,1004,James,Davis,james.d@email.com,San Antonio,TX,31,M,Occasional,2/28/23,Direct,445.6,High,,,
4,1005,Maria,Garcia,maria.g@email.com,Austin,TX,26,F,New,1/5/24,Instagram,125.4,Medium,,,


In [29]:
# Use a column as the index

df = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers_no_header.csv', 
                 header = None, 
                names = ('trans_id','cust_id','order_date','channel','store_id','payment_method','card_bank','subtotal','discount_amount',
                        'tax_amount','shipping_amount','total_amount','promo_code_used','device_type','fulfillment_method',
                         'return_flag'),index_col = 'trans_id')
df.head()

Unnamed: 0_level_0,cust_id,order_date,channel,store_id,payment_method,card_bank,subtotal,discount_amount,tax_amount,shipping_amount,total_amount,promo_code_used,device_type,fulfillment_method,return_flag
trans_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,F,Occasional,3/15/22,Social Media,892.5,Low,,,
1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,1/20/21,Google,4587.3,Low,,,
1003,Sarah,Williams,sarah.w@email.com,Houston,TX,42,F,Frequent,6/10/21,Email,2245.8,Medium,,,
1004,James,Davis,james.d@email.com,San Antonio,TX,31,M,Occasional,2/28/23,Direct,445.6,High,,,
1005,Maria,Garcia,maria.g@email.com,Austin,TX,26,F,New,1/5/24,Instagram,125.4,Medium,,,


In [30]:
df.loc[1004]

cust_id                           James
order_date                        Davis
channel               james.d@email.com
store_id                    San Antonio
payment_method                       TX
card_bank                            31
subtotal                              M
discount_amount              Occasional
tax_amount                      2/28/23
shipping_amount                  Direct
total_amount                      445.6
promo_code_used                    High
device_type                         NaN
fulfillment_method                  NaN
return_flag                         NaN
Name: 1004, dtype: object

In [31]:
# Only use a subset of the columns


df = pd.read_csv('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers_no_header.csv', 
                 header = None, 
                names = ('trans_id','order_date','store_id'),usecols = [0,2,4])
df.head()



Unnamed: 0,trans_id,order_date,store_id
0,1001,Johnson,Austin
1,1002,Chen,Dallas
2,1003,Williams,Houston
3,1004,Davis,San Antonio
4,1005,Garcia,Austin


### Reading other file types

 - excel
 - json
 - APIs
 - database

In [32]:
# Read in an excel spreadsheet
# I had to install openpyxl in my anaconda environment for this to work.

c_excel = pd.read_excel('https://raw.githubusercontent.com/jimcody2014/eic/refs/heads/main/customers.xlsx')
c_excel.head()

Unnamed: 0,customer_id,first_name,last_name,email,city,state,age,gender,customer_segment,acquisition_date,acquisition_channel,lifetime_value,churn_risk
0,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,F,Occasional,2022-03-15,Social Media,892.5,Low
1,1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,2021-01-20,Google,4587.3,Low
2,1003,Sarah,Williams,sarah.w@email.com,Houston,TX,42,F,Frequent,2021-06-10,Email,2245.8,Medium
3,1004,James,Davis,james.d@email.com,San Antonio,TX,31,M,Occasional,2023-02-28,Direct,445.6,High
4,1005,Maria,Garcia,maria.g@email.com,Austin,TX,26,F,New,2024-01-05,Instagram,125.4,Medium


In [33]:
# Saving a dataframe to json format
c_excel.to_json('cust.json')

In [41]:
# This identifies the current working directory.  That is where the json is written to.
import os
print(os.getcwd())

/Users/jcody/GitHub/eic-notebooks


In [43]:
print((os.getcwd()+'/cust.json'))

/Users/jcody/GitHub/eic-notebooks/cust.json


In [42]:
# Read in a json file

c2 = pd.read_json(os.getcwd()+'/cust.json')
c2.head()

Unnamed: 0,customer_id,first_name,last_name,email,city,state,age,gender,customer_segment,acquisition_date,acquisition_channel,lifetime_value,churn_risk
0,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,F,Occasional,1647302400000,Social Media,892.5,Low
1,1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,1611100800000,Google,4587.3,Low
2,1003,Sarah,Williams,sarah.w@email.com,Houston,TX,42,F,Frequent,1623283200000,Email,2245.8,Medium
3,1004,James,Davis,james.d@email.com,San Antonio,TX,31,M,Occasional,1677542400000,Direct,445.6,High
4,1005,Maria,Garcia,maria.g@email.com,Austin,TX,26,F,New,1704412800000,Instagram,125.4,Medium


### Using an API as input

Example API call for Chicago 
You can use Chicago's latitude 41.85 and longitude -87.65 to get a 16-day hourly forecast. 

https://api.open-meteo.com/v1/forecast?latitude=41.85&longitude=-87.65&hourly=temperature_2m,weather_code,wind_speed_10m 

In [None]:
import requests

In [None]:
# Send the request and receive a response
# Get the content of the response (there are other parts of the response (e.g., header))
# Display the content in json format

response = requests.get("https://api.open-meteo.com/v1/forecast?latitude=41.85&longitude=-87.65&hourly=temperature_2m,weather_code,wind_speed_10m")
jsonhold = response.json()
jsonhold




In [None]:
# Put the content into a DataFrame
# Display the DataFrame

chicago

In [None]:
# Same process
# Just set the url as a variable
# Combined a few other steps

url = 'https://api.open-meteo.com/v1/forecast?latitude=41.85&longitude=-87.65&hourly=temperature_2m,weather_code,wind_speed_10m'
response = requests.get(url)
chicago = pd.DataFrame(response.json())
chicago

## Exercise - 15 minutes

### Part 1

 - Bring the csv file customer_touchpoints.csv into a DataFrame.  Name the dataframe df.
 - What is its shape?
 
### Part 2

 - Read marketing_campaigns_no-header.csv file into a df.  Name the df something you will remember.
 - The csv file does not have a header row. 
 
 - Do not bring in the 6th or 8th columns (indexing from 0.  The df should have 10 columns.
 - The remaining columns contain the following data.  You can decide how to name the columns:
    - campaign_id,
    - campaign_name,
    - campaign_type,
    - channel,
    - start_date,
    - end_date,
    - budget,(6th)
    - target_audience,
    - campaign_goal,status (8th)
 - What is the shape of this DataFrame?
   
   

In [None]:
# Part 1




In [None]:
# Part 2

## Merge, Join & Concat

This is the pandas documentation for merge, join and concat.
https://pandas.pydata.org/docs/user_guide/merging.html

### Short notes

1. Merge and join bring dataframes together to create more columns
2. Concat stacks dataframes.  Think unions and intersections

**Are there any practical differences between using merge or join in pandas?**

Yes, there are several practical differences between merge() and join() in pandas:

Key Differences
1. Default Join Behavior:

- merge(): Defaults to inner join on common column names
- join(): Defaults to left join on index

2. What They Join On:

- merge(): Primarily joins on columns (can join on index with parameters)
- join(): Primarily joins on index (can join on columns with parameters)

**Practical Considerations**

Use merge() when:

- Joining on columns (most common case)
- You want explicit control over join conditions
- Working with DataFrames that don't have meaningful indexes
- You need maximum flexibility

Use join() when:

- Joining on index (especially when index is meaningful)
- You want concise syntax for simple joins
- Joining multiple DataFrames at once
- Working with time series data (common to have datetime index)

In [47]:
# Simple Merge Example

listing = pd.merge(customers,transactions, on='customer_id')
listing.head()

Unnamed: 0,customer_id,first_name,last_name,email,city,state,age,gender,customer_segment,acquisition_date,...,card_bank,subtotal,discount_amount,tax_amount,shipping_amount,total_amount,promo_code_used,device_type,fulfillment_method,return_flag
0,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,0,Occasional,3/15/22,...,Chase,189.97,20.0,15.2,0.0,185.17,SPRING24,Mobile,Ship,False
1,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,0,Occasional,3/15/22,...,Citi,299.96,0.0,24.0,0.0,323.96,,,In-Store,False
2,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,0,Occasional,3/15/22,...,,249.95,25.0,18.0,0.0,242.95,VIP20,Desktop,Ship,False
3,1001,Emily,Johnson,emily.j@email.com,Austin,TX,28,0,Occasional,3/15/22,...,,149.97,0.0,12.0,0.0,161.97,,,In-Store,False
4,1002,Michael,Chen,m.chen@email.com,Dallas,TX,35,M,High-Value,1/20/21,...,,79.99,0.0,6.4,0.0,86.39,,,In-Store,False


In [48]:
df1 = {
    'location':['bolton','berlin','boyleston','charlton'],
    'apples': [3, 2, 0, 1], 
    'pears': [0, 3, 7, 2]
}

df2 = {
    'location':['bolton','berlin','boyleston','charlton'],
    'blueberries': [3, 2, 0, 1], 
    'strawberries': [0, 3, 7, 2]
}

d1 = pd.DataFrame(df1)
d2 = pd.DataFrame(df2)
d1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   location  4 non-null      object
 1   apples    4 non-null      int64 
 2   pears     4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


In [49]:
pd.merge(d1,d2,how = 'inner')  # Inner, outer, left, right
# When not being explicit, the merge is based on the index for each dataframe.

Unnamed: 0,location,apples,pears,blueberries,strawberries
0,bolton,3,0,3,0
1,berlin,2,3,2,3
2,boyleston,0,7,0,7
3,charlton,1,2,1,2


In [50]:
df3 = {
    'state':['MA','MA','VT','VT'],
    'location':['bolton','berlin','boyleston','berlin'],
    'apples': [3, 2, 0, 1], 
    'pears': [0, 3, 7, 2]
}

df4 = {
    'state':['MA','MA','VT','VT'],
    'location':['bolton','berlin','boyleston','berlin'],
    'blueberries': [3, 2, 0, 1], 
    'strawberries': [0, 3, 7, 2]
}

d3 = pd.DataFrame(df3)
d4 = pd.DataFrame(df4)

In [51]:
multi = pd.merge(d3,d4, how = 'inner', on = ['state','location'])
multi

Unnamed: 0,state,location,apples,pears,blueberries,strawberries
0,MA,bolton,3,0,3,0
1,MA,berlin,2,3,2,3
2,VT,boyleston,0,7,0,7
3,VT,berlin,1,2,1,2


### Join example

In [52]:
# basic syntax - first_dataframe.join(to second dataframe)
d1.join(d2, lsuffix = '_1')

# d1.join(d2, lsuffix = '_1', rsuffix = '_2')


Unnamed: 0,location_1,apples,pears,location,blueberries,strawberries
0,bolton,3,0,bolton,3,0
1,berlin,2,3,berlin,2,3
2,boyleston,0,7,boyleston,0,7
3,charlton,1,2,charlton,1,2
