<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Pandas: Basic Manipulation of Series/DataFrames
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 1: Topic 4</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    

#### pandas Series and DataFrames: changing attributes and values
- Cleaning and altering column/index names
- Creating and removing columns/rows
- Altering values 
- Changing datatypes


Import our libraries:

In [1]:
import numpy as np
import pandas as pd

Load in our trusty cereal dataset again:

In [2]:
cereal_df = pd.read_csv('Data/cereal.csv')
cereal_df.columns 

Index(['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber',
       'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups',
       'rating'],
      dtype='object')

Want to rename some of these columns.

#### Renaming columns 

- DataFrame.rename(columns = ___)
- columns takes in a dict that maps column names.

In [4]:
cereal_df.rename(columns = {'mfr': 'manufacturer', 'carbo': 'carbohydate', 'potass': 'potassium  '})

Unnamed: 0,name,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.00,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.50,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Triples,G,C,110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
73,Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.00,27.753301
74,Wheat Chex,R,C,100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
75,Wheaties,G,C,100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.00,51.592193


In [5]:
cereal_df.head(2)

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679


Column names are still the same. What gives?

Dataframe.rename() method creates new dataframe by default.

In [6]:
cereal_df = cereal_df.rename(columns = {'mfr': 'manufacturer', 'carbo': 'carbohydate', 'potass': 'potassium  '})

This is equivalent to reassigning (inplace = ... argument)

In [7]:
cereal_df.rename(columns = {'mfr': 'manufacturer', 'carbo': 'carbohydate', 'potass': 'potassium  '}, inplace = True)

In [None]:
cereal_df.head(2)

Let's take a look at the potassium column.

In [8]:
cereal_df['potassium']

KeyError: 'potassium'

What happened?

In [9]:
cereal_df.columns

Index(['name', 'manufacturer', 'type', 'calories', 'protein', 'fat', 'sodium',
       'fiber', 'carbohydate', 'sugars', 'potassium  ', 'vitamins', 'shelf',
       'weight', 'cups', 'rating'],
      dtype='object')

Note the trailing white space. Many imports from files have this problem.

What string command do we need to trim white space?

The way NOT to do it (works but not efficient):

In [9]:
[col.strip() for col in cereal_df.columns]

['name',
 'manufacturer',
 'type',
 'calories',
 'protein',
 'fat',
 'sodium',
 'fiber',
 'carbohydate',
 'sugars',
 'potassium',
 'vitamins',
 'shelf',
 'weight',
 'cups',
 'rating']

The way to take advantage of Pandas speed (vectorized str method):

In [11]:
cereal_df.columns = cereal_df.columns.str.strip()
print(cereal_df.columns)

Index(['name', 'manufacturer', 'type', 'calories', 'protein', 'fat', 'sodium',
       'fiber', 'carbohydate', 'sugars', 'potassium', 'vitamins', 'shelf',
       'weight', 'cups', 'rating'],
      dtype='object')


Now look at potassium column:

In [12]:
cereal_df['potassium'].head(3)

0    280
1    135
2    320
Name: potassium, dtype: int64

#### Removing Columns

The `shelf` column: shelf in cereal aisle of particular grocery store.

- We dont care about this column.   


<figure><center><img src = "Images/snoop_dogg.jpg" width = 400></center>

</figure>



In [13]:
cereal_df.drop(columns = ['shelf'], inplace = True)
cereal_df.columns

Index(['name', 'manufacturer', 'type', 'calories', 'protein', 'fat', 'sodium',
       'fiber', 'carbohydate', 'sugars', 'potassium', 'vitamins', 'weight',
       'cups', 'rating'],
      dtype='object')

#### Creating new columns

The 'type' column has only two unique entries 'C' and 'H' (cold or hot cereal?):

- Can use Boolean condition to create a series (Pandas magic).

Is this a hot cereal? Convert Boolean to integer.

- False = 0
- True = 1

In [19]:
is_hot = (cereal_df.type == 'H').astype('int')


print(is_hot)
print(is_hot.value_counts())

0     0
1     0
2     0
3     0
4     0
     ..
72    0
73    0
74    0
75    0
76    0
Name: type, Length: 77, dtype: int32
0    74
1     3
Name: type, dtype: int64


Store this as a new column by:
- DataFrame[new_column_name] = series

In [20]:
cereal_df['is_hot'] = is_hot
cereal_df.head()

Unnamed: 0,name,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,1.0,0.33,68.402973,0
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,1.0,0.75,34.384843,0


#### Dealing with the index and rows:
- Clearly, the 'name' column should be our index.
- .set_index(col_name) will set that column to the row index.
- .set_index() can also take in a list or an index object.

In [21]:
cereal_df.set_index('name', inplace = True)
cereal_df.head()

Unnamed: 0_level_0,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,1.0,0.33,68.402973,0
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,1.0,0.75,34.384843,0


- Sometimes we want to reset the index.
- This takes index to a column again.
- Dataframe index is integer-indexed.

In [22]:
cereal_df.reset_index(inplace = True)
cereal_df.head()

Unnamed: 0,name,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,1.0,0.33,68.402973,0
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,1.0,0.75,34.384843,0


Dropping rows by index name:

In [23]:
cereal_df.set_index('name', inplace = True)
allbran_dropped = cereal_df.drop('All-Bran') # can also take a list of index names or an index object
allbran_dropped.head(4)

Unnamed: 0_level_0,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,1.0,0.33,68.402973,0
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,1.0,0.75,34.384843,0


In [24]:
two_dropped = cereal_df.drop(['100% Bran', 'Almond Delight'])
two_dropped.head()

Unnamed: 0_level_0,manufacturer,type,calories,protein,fat,sodium,fiber,carbohydate,sugars,potassium,vitamins,weight,cups,rating,is_hot
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,1.0,1.0,33.983679,0
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,1.0,0.33,59.425505,0
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,1.0,0.5,93.704912,0
Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1.0,0.75,29.509541,0
Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,1.0,1.0,33.174094,0


#### Altering dataframe/series values

- It's really important to use the .loc[] accessor when assigning data to dataframe/series selections.
- Here's why:

Select all cold cereals and look at their rating:

In [25]:
cereal_df[cereal_df["type"] == 'C']["rating"] 

name
100% Bran                    68.402973
100% Natural Bran            33.983679
All-Bran                     59.425505
All-Bran with Extra Fiber    93.704912
Almond Delight               34.384843
                               ...    
Triples                      39.106174
Trix                         27.753301
Wheat Chex                   49.787445
Wheaties                     51.592193
Wheaties Honey Gold          36.187559
Name: rating, Length: 74, dtype: float64

Now, add 5 to this selection.

In [26]:
cereal_df[cereal_df["type"] == 'C']["rating"] + 5

name
100% Bran                    73.402973
100% Natural Bran            38.983679
All-Bran                     64.425505
All-Bran with Extra Fiber    98.704912
Almond Delight               39.384843
                               ...    
Triples                      44.106174
Trix                         32.753301
Wheat Chex                   54.787445
Wheaties                     56.592193
Wheaties Honey Gold          41.187559
Name: rating, Length: 74, dtype: float64

Assign this modification to our original selection:

In [27]:
cereal_df[cereal_df["type"] == 'C']["rating"] = cereal_df[cereal_df["type"] == 'C']["rating"] + 5 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cereal_df[cereal_df["type"] == 'C']["rating"] = cereal_df[cereal_df["type"] == 'C']["rating"] + 5


Uh...oh. A warning was issued. Let's see what our assignment did:

In [28]:
cereal_df[cereal_df["type"] == 'C']["rating"]

name
100% Bran                    68.402973
100% Natural Bran            33.983679
All-Bran                     59.425505
All-Bran with Extra Fiber    93.704912
Almond Delight               34.384843
                               ...    
Triples                      39.106174
Trix                         27.753301
Wheat Chex                   49.787445
Wheaties                     51.592193
Wheaties Honey Gold          36.187559
Name: rating, Length: 74, dtype: float64

No change was made to original dataframe.

.loc accessor[] accesses original dataframe in memory. Thus:

In [24]:
cereal_df.loc[cereal_df["type"] == 'C', "rating"] += 5
cereal_df.loc[cereal_df["type"] == 'C', "rating"]

name
100% Bran                    73.402973
100% Natural Bran            38.983679
All-Bran                     64.425505
All-Bran with Extra Fiber    98.704912
Almond Delight               39.384843
                               ...    
Triples                      44.106174
Trix                         32.753301
Wheat Chex                   54.787445
Wheaties                     56.592193
Wheaties Honey Gold          41.187559
Name: rating, Length: 74, dtype: float64

#### Datetime indices
- Pandas supports datetime types
- Series/DataFrame index: special operations/functionality for datetimes

Load MTA turnsile maintenance dataset to see pandas datetimes in action!

In [29]:
turnstile_df = pd.read_csv('Data/turnstile_180901.txt')
turnstile_df.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,00:00:00,REGULAR,6736067,2283184
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/25/2018,04:00:00,REGULAR,6736087,2283188


In [26]:
turnstile_df['DATE']

0         08/25/2018
1         08/25/2018
2         08/25/2018
3         08/25/2018
4         08/25/2018
             ...    
197620    08/31/2018
197621    08/31/2018
197622    08/31/2018
197623    08/31/2018
197624    08/31/2018
Name: DATE, Length: 197625, dtype: object

In [27]:
turnstile_df['TIME']

0         00:00:00
1         04:00:00
2         08:00:00
3         12:00:00
4         16:00:00
            ...   
197620    05:00:00
197621    09:00:00
197622    13:00:00
197623    17:00:00
197624    21:00:00
Name: TIME, Length: 197625, dtype: object

Both in string format.

- Join date and time.
- Assign to new column.

In [30]:
    turnstile_df['DATETIME'] = turnstile_df['DATE'] + ' ' + turnstile_df['TIME']
    turnstile_df.drop(columns = ['DATE', 'TIME'], inplace = True)
    turnstile_df['DATETIME'].head()

0    08/25/2018 00:00:00
1    08/25/2018 04:00:00
2    08/25/2018 08:00:00
3    08/25/2018 12:00:00
4    08/25/2018 16:00:00
Name: DATETIME, dtype: object

In [32]:
turnstile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197625 entries, 0 to 197624
Data columns (total 10 columns):
 #   Column                                                                Non-Null Count   Dtype 
---  ------                                                                --------------   ----- 
 0   C/A                                                                   197625 non-null  object
 1   UNIT                                                                  197625 non-null  object
 2   SCP                                                                   197625 non-null  object
 3   STATION                                                               197625 non-null  object
 4   LINENAME                                                              197625 non-null  object
 5   DIVISION                                                              197625 non-null  object
 6   DESC                                                                  197625 non-null  objec

- Convert string to datetime type.
- pd.to_datetime(): can intelligently parse various common datetime string formats
- %m/%d/%Y date format parsing.

In [33]:
 turnstile_df['DATETIME'] = pd.to_datetime(turnstile_df['DATETIME'])

In [34]:
turnstile_df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS,DATETIME
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736067,2283184,2018-08-25 00:00:00
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736087,2283188,2018-08-25 04:00:00
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736105,2283229,2018-08-25 08:00:00
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736180,2283314,2018-08-25 12:00:00
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736349,2283384,2018-08-25 16:00:00


In [35]:
turnstile_df['DATETIME']

0        2018-08-25 00:00:00
1        2018-08-25 04:00:00
2        2018-08-25 08:00:00
3        2018-08-25 12:00:00
4        2018-08-25 16:00:00
                 ...        
197620   2018-08-31 05:00:00
197621   2018-08-31 09:00:00
197622   2018-08-31 13:00:00
197623   2018-08-31 17:00:00
197624   2018-08-31 21:00:00
Name: DATETIME, Length: 197625, dtype: datetime64[ns]

This is a datetime series. Datetime series have vectorized methods and attributes that are very useful.
- Round date to nearest start of week.
- Get named day of week for date.

In [32]:
turnstile_df['DATETIME'].dt.round('7D')

0        2018-08-23
1        2018-08-23
2        2018-08-23
3        2018-08-23
4        2018-08-23
            ...    
197620   2018-08-30
197621   2018-08-30
197622   2018-08-30
197623   2018-08-30
197624   2018-08-30
Name: DATETIME, Length: 197625, dtype: datetime64[ns]

In [36]:
turnstile_df['DATETIME'].dt.day_name()

0         Saturday
1         Saturday
2         Saturday
3         Saturday
4         Saturday
            ...   
197620      Friday
197621      Friday
197622      Friday
197623      Friday
197624      Friday
Name: DATETIME, Length: 197625, dtype: object

Set column to our datetime index.

In [37]:
turnstile_df.set_index('DATETIME', inplace = True)

We now have a datetime index.

In [43]:
turnstile_df['2018-08-25':'2018-08-27']

Unnamed: 0_level_0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DESC,ENTRIES,EXITS
DATETIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-08-25 00:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736067,2283184
2018-08-25 04:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736087,2283188
2018-08-25 08:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736105,2283229
2018-08-25 12:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736180,2283314
2018-08-25 16:00:00,A002,R051,02-00-00,59 ST,NQR456W,BMT,REGULAR,6736349,2283384
...,...,...,...,...,...,...,...,...,...
2018-08-27 05:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,REGULAR,5554,348
2018-08-27 09:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,REGULAR,5554,348
2018-08-27 13:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,REGULAR,5554,348
2018-08-27 17:00:00,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,REGULAR,5554,348


Pandas datetime indexes: nice functionality for transforming time series.

- Will get into this later.