### Pandas - Functions

In [1]:
# importing the libraries
import numpy as np
import pandas as pd

First, we have to give google colab access to our google drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Once we have access we can load files from google drive using read_csv() function.

In [4]:
path="../data/StockData.csv"
data=pd.read_csv(path)

In [5]:
# head() function helps us to see the first 5 rows of the data
data.head()

Unnamed: 0,stock,date,price
0,AAPL,08-02-2013,67.8542
1,AAPL,11-02-2013,68.5614
2,AAPL,12-02-2013,66.8428
3,AAPL,13-02-2013,66.7156
4,AAPL,14-02-2013,66.6556


**Group By function**
* Pandas dataframe.groupby() function is used to split the data into groups based on some criteria.

In [6]:
data.groupby(['stock'])['price'].mean()

stock
AAPL    109.066698
SNI      71.319206
TJX      66.743566
ZTS      45.098648
Name: price, dtype: float64

* Here the groupby function is used to split the data into the 4 stocks that are present in the dataset and then the mean price of each of the 4 stock is calculated.

In [7]:
# similarly we can get the median price of each stock
data.groupby(['stock'])['price'].median()

stock
AAPL    109.01
SNI      72.31
TJX      68.85
ZTS      45.62
Name: price, dtype: float64

* Here the groupby function is used to split the data into the 4 stocks that are present in the dataset and then the median price of each of the 4 stock is calculated.

**Let's create a function to increase the price of the stock by 10%**

In [8]:
def profit(s):
    return s + s*0.10 # increase of 10%

**The Pandas apply() function lets you to manipulate columns and rows in a DataFrame.**

In [9]:
data['price'].apply(profit)

0       74.63962
1       75.41754
2       73.52708
3       73.38716
4       73.32116
          ...   
5031    85.60200
5032    84.45800
5033    81.21300
5034    80.59700
5035    81.24600
Name: price, Length: 5036, dtype: float64

* We can now add this updated values in the dataset.

In [10]:
data['new_price'] =data['price'].apply(profit)
data.head()

Unnamed: 0,stock,date,price,new_price
0,AAPL,08-02-2013,67.8542,74.63962
1,AAPL,11-02-2013,68.5614,75.41754
2,AAPL,12-02-2013,66.8428,73.52708
3,AAPL,13-02-2013,66.7156,73.38716
4,AAPL,14-02-2013,66.6556,73.32116


**Pandas sort_values() function sorts a data frame in ascending or descending order of passed column.**

In [11]:
data.sort_values(by='new_price',ascending=False) # by default ascending is set to True

Unnamed: 0,stock,date,price,new_price
1244,AAPL,18-01-2018,179.26,197.186
1243,AAPL,17-01-2018,179.10,197.010
1245,AAPL,19-01-2018,178.46,196.306
1241,AAPL,12-01-2018,177.09,194.799
1247,AAPL,23-01-2018,177.04,194.744
...,...,...,...,...
4076,ZTS,17-04-2014,28.60,31.460
4074,ZTS,15-04-2014,28.55,31.405
4075,ZTS,16-04-2014,28.53,31.383
4073,ZTS,14-04-2014,28.48,31.328


### Pandas - Date-time Functions

In [12]:
# checking the data type of columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5036 entries, 0 to 5035
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   stock      5036 non-null   object 
 1   date       5036 non-null   object 
 2   price      5036 non-null   float64
 3   new_price  5036 non-null   float64
dtypes: float64(2), object(2)
memory usage: 157.5+ KB


* We observe that the date column is of object type whereas it should be of date time data type.

In [13]:
# converting the date column to datetime format
data['date']  = pd.to_datetime(data['date'],dayfirst=True)

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5036 entries, 0 to 5035
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   stock      5036 non-null   object        
 1   date       5036 non-null   datetime64[ns]
 2   price      5036 non-null   float64       
 3   new_price  5036 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 157.5+ KB


* We observe that the date column has been converted to datetime format

In [15]:
data.head()

Unnamed: 0,stock,date,price,new_price
0,AAPL,2013-02-08,67.8542,74.63962
1,AAPL,2013-02-11,68.5614,75.41754
2,AAPL,2013-02-12,66.8428,73.52708
3,AAPL,2013-02-13,66.7156,73.38716
4,AAPL,2013-02-14,66.6556,73.32116


**The column 'date' is now in datetime format. Now we can change the format of the date to any other format**

In [16]:
data['date'].dt.strftime('%m/%d/%Y')

0       02/08/2013
1       02/11/2013
2       02/12/2013
3       02/13/2013
4       02/14/2013
           ...    
5031    02/01/2018
5032    02/02/2018
5033    02/05/2018
5034    02/06/2018
5035    02/07/2018
Name: date, Length: 5036, dtype: object

In [17]:
data['date'].dt.strftime('%m-%d-%y')

0       02-08-13
1       02-11-13
2       02-12-13
3       02-13-13
4       02-14-13
          ...   
5031    02-01-18
5032    02-02-18
5033    02-05-18
5034    02-06-18
5035    02-07-18
Name: date, Length: 5036, dtype: object

**Extracting year from the date column**

In [18]:
data['date'].dt.year

0       2013
1       2013
2       2013
3       2013
4       2013
        ... 
5031    2018
5032    2018
5033    2018
5034    2018
5035    2018
Name: date, Length: 5036, dtype: int32

Creating a new column and adding the extracted year values into the dataframe.

In [19]:
data['year'] = data['date'].dt.year

**Extracting month from the date column**

In [20]:
data['date'].dt.month

0       2
1       2
2       2
3       2
4       2
       ..
5031    2
5032    2
5033    2
5034    2
5035    2
Name: date, Length: 5036, dtype: int32

Creating a new column and adding the extracted month values into the dataframe.

In [21]:
data['month'] = data['date'].dt.month

**Extracting day from the date column**

In [22]:
data['date'].dt.day

0        8
1       11
2       12
3       13
4       14
        ..
5031     1
5032     2
5033     5
5034     6
5035     7
Name: date, Length: 5036, dtype: int32

Creating a new column and adding the extracted day values into the dataframe.

In [23]:
data['day'] = data['date'].dt.day

In [24]:
data.head()

Unnamed: 0,stock,date,price,new_price,year,month,day
0,AAPL,2013-02-08,67.8542,74.63962,2013,2,8
1,AAPL,2013-02-11,68.5614,75.41754,2013,2,11
2,AAPL,2013-02-12,66.8428,73.52708,2013,2,12
3,AAPL,2013-02-13,66.7156,73.38716,2013,2,13
4,AAPL,2013-02-14,66.6556,73.32116,2013,2,14


* We can see that year, month, and day columns have been added in the dataset.

In [25]:
# The datetime format is convenient for many tasks!
data['date'][1]-data['date'][0]

Timedelta('3 days 00:00:00')