<h1><center>DS 200 - Lec8 Pandas I/O and Operations</center></h1>

## Section 1: Data Input and Output

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its `pd.read_` methods. Let's take a look at the most common data types:

In [1]:
import numpy as np
import pandas as pd

### CSV File Type

#### CSV Input

Jupyter supports host commands: use ! with the command.

If you are using Linux or MacOS, use the following:

In [2]:
!wget https://raw.githubusercontent.com/BlueJayADAL/DS200/master/datasets/stock_data.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.


If you are using Windows, use the following:

In [3]:
!curl https://raw.githubusercontent.com/BlueJayADAL/DS200/master/datasets/stock_data.csv -o stock_data.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   178  100   178    0     0   1769      0 --:--:-- --:--:-- --:--:--  1780


Use `read_csv()` to read the local file in.

In [4]:
df = pd.read_csv('stock_data.csv')



# Display df
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


#### CSV Output

Now output the DataFrame `df` to a csv file called `example.csv`

In [5]:
df.to_csv('example.csv', index = False)



#### Use data URL directly.

__Read CSV file__

In [6]:
dataurl = 'https://raw.githubusercontent.com/BlueJayADAL/DS200/master/datasets/stock_data.csv'


df = pd.read_csv(dataurl)


# Display df
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


__Read CSV file and skip number of rows = 1__

In [7]:
df = pd.read_csv(dataurl, skiprows = 1)



# Display df
df

Unnamed: 0,GOOGL,27.82,87,845,larry page
0,WMT,4.61,484,65,n.a.
1,MSFT,-1,85,64,bill gates
2,RIL,not available,50,1023,mukesh ambani
3,TATA,5.6,-1,n.a.,ratan tata


__Read CSV file and take row index = 1 as the header (column names). Default header='infer'__

In [8]:
df = pd.read_csv(dataurl, header = 1)

df

Unnamed: 0,GOOGL,27.82,87,845,larry page
0,WMT,4.61,484,65,n.a.
1,MSFT,-1,85,64,bill gates
2,RIL,not available,50,1023,mukesh ambani
3,TATA,5.6,-1,n.a.,ratan tata


__Read CSV file, do not take any row from the data as the header (column names), and use the provided header (column names)__

In [9]:
df = pd.read_csv(dataurl, index_col = 0)



df

Unnamed: 0_level_0,eps,revenue,price,people
tickers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GOOGL,27.82,87,845,larry page
WMT,4.61,484,65,n.a.
MSFT,-1,85,64,bill gates
RIL,not available,50,1023,mukesh ambani
TATA,5.6,-1,n.a.,ratan tata


__Only read the first 2 rows from the CSV file__

In [10]:
df = pd.read_csv(dataurl, nrows = 2)


df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.


__Read CSV file and provide the possible representations of NaN values in the data__

In [11]:
df = pd.read_csv(dataurl, na_values= ['not available', 'n.a.'])


df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845.0,larry page
1,WMT,4.61,484,65.0,
2,MSFT,-1.0,85,64.0,bill gates
3,RIL,,50,1023.0,mukesh ambani
4,TATA,5.6,-1,,ratan tata


__(Optional) Read CSV file and provide the possible representations of NaN values for each column respectively.__

In [12]:
df = pd.read_csv(dataurl,  na_values={
        'eps': ['not available'],
        'revenue': [-1],
        'people': ['not available','n.a.']
    })
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845,larry page
1,WMT,4.61,484.0,65,
2,MSFT,-1.0,85.0,64,bill gates
3,RIL,,50.0,1023,mukesh ambani
4,TATA,5.6,,n.a.,ratan tata


__Write to a pickle file__

In [13]:
df.to_pickle('temp_df.pkl')



In [14]:
temp_df = pd.read_pickle('temp_df.pkl')


# Display
temp_df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845,larry page
1,WMT,4.61,484.0,65,
2,MSFT,-1.0,85.0,64,bill gates
3,RIL,,50.0,1023,mukesh ambani
4,TATA,5.6,,n.a.,ratan tata


____
### Excel Data Type
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash. 

#### Excel Input

In [18]:
excel_url = 'https://github.com/BlueJayADAL/DS200/blob/master/datasets/stock_data.xlsx?raw=true'

In [19]:
df = pd.read_excel(excel_url, sheet_name = 'Sheet1')
df


Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


#### Excel Output

### Pandas I/O Exercises

<font color="purple">Given the following data URL:</font>

In [20]:
url = 'https://raw.githubusercontent.com/BlueJayADAL/DS200/master/datasets/weather_data.csv'

<font color="purple">Read in the data into a DataFrame named `df`:</font>

In [21]:
df = pd.read_csv(url)



In [22]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


<font color="purple">Show the data types of all the columns: </font>

In [23]:
df.head(n = 6)



Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny


In [24]:
df.tail(n = 7)

Unnamed: 0,day,temperature,windspeed,event
2,1/5/2017,28.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,40.0,12.0,Sunny


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          9 non-null      object 
 1   temperature  5 non-null      float64
 2   windspeed    5 non-null      float64
 3   event        7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes


In [26]:
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [27]:
df.index

RangeIndex(start=0, stop=9, step=1)

<font color="purple">Now, try to read in the same dataset but parse the `day` column as `Timestamp` type with the `parse_dates` parameter. </font>

In [45]:
df = pd.read_csv(url, parse_dates = ['day'])



In [46]:
df # note how timestamp is printed this time

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,9.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,,,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


<font color="purple">Show the data types for all columns again.</font>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   day          9 non-null      datetime64[ns]
 1   temperature  5 non-null      float64       
 2   windspeed    5 non-null      float64       
 3   event        7 non-null      object        
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 416.0+ bytes


<font color="purple">Now, use `set_index()` function to set the index of the DataFrame with the `day` column instead of using the default numeric index.</font>

In [47]:
# Change the default numeric index to column 'day'
df.set_index(keys ='day', inplace = True)



df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


<font color="purple">Fill all NaN with one specific value, e.g. 0. However, sometimes 0 is not the best guess</font>

In [48]:
new_df = df.fillna(0)



new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,0
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,0
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


<font color="purple">Fill the NaNs with the given dictionary `dict`: </font>

```
{
    'temperature': 0,
    'windspeed': 0,
    'event': 'No Event'
}
```

In [50]:
dict = {
        'temperature': 0,
        'windspeed': 0,
        'event': 'No Event'
       }

In [51]:
new_df = df.fillna(dict)



new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,No Event
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,No Event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


<font color="purple">Propagate NaN values with forward fill</font>

In [52]:
#  propagate NaN with forward fill.
new_df = df.fillna(method='ffill')



new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,7.0,Sunny
2017-01-09,32.0,7.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


<font color="purple">Propagate NaN values with backward fill</font>

In [53]:
#  propagate NaN with backward fill.
new_df = df.fillna(method='bfill')



new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,28.0,9.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,32.0,7.0,Rain
2017-01-07,32.0,8.0,Rain
2017-01-08,34.0,8.0,Sunny
2017-01-09,34.0,8.0,Cloudy
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


<font color="purple">Propagate NaN values with backward fill, but on axis = 1</font>

In [54]:
new_df = df.fillna(method='bfill', axis = 1)



new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,9.0,9.0,Sunny
2017-01-05,28.0,Snow,Snow
2017-01-06,7.0,7.0,
2017-01-07,32.0,Rain,Rain
2017-01-08,Sunny,Sunny,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


<font color="purple">Propagate NaN values with forward fill and limit the filling distance to 1.</font>

In [55]:
new_df = df.fillna(method='ffill', limit = 1)



new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,,Sunny
2017-01-09,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


## Section 2: Panda Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [56]:
df1 = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df1

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [57]:
df2 = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':[5, 6, 7, 8]})
df2

Unnamed: 0,col1,col2,col3
0,1,444,5
1,2,555,6
2,3,666,7
3,4,444,8


### Info on Unique Values

In [58]:
df1['col2'].unique()



array([444, 555, 666], dtype=int64)

In [59]:
df1['col2'].nunique()



3

In [60]:
df1['col2'].value_counts()



444    2
555    1
666    1
Name: col2, dtype: int64

### Data Transformation: `apply()` and `applymap()` Functions

In [61]:
def times2(x):
    return x*2

In [62]:
def max_minus_min(x): 
    return max(x)-min(x)


#### Transform df1 and df2 with times2() function

In [67]:
df1.apply(times2)



Unnamed: 0,col1,col2,col3
0,2,888,abcabc
1,4,1110,defdef
2,6,1332,ghighi
3,8,888,xyzxyz


In [71]:
df2.apply(times2)



Unnamed: 0,col1,col2,col3
0,2,888,10
1,4,1110,12
2,6,1332,14
3,8,888,16


#### Transform df1 and df2 with max_minus_min() function

In [72]:
df1.apply(max_minus_min)



TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [73]:
df2.apply(max_minus_min)



col1      3
col2    222
col3      3
dtype: int64

#### Transform df2 with `times2()` and `max_minus_min()` using `applymap()`

In [74]:
df2.applymap(times2)



Unnamed: 0,col1,col2,col3
0,2,888,10
1,4,1110,12
2,6,1332,14
3,8,888,16


In [75]:
df2.applymap(max_minus_min)



TypeError: 'int' object is not iterable

__Get column and index names:__

Index(['col1', 'col2', 'col3'], dtype='object')

RangeIndex(start=0, stop=4, step=1)

__Sorting and Ordering a DataFrame:__

In [49]:
df1

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


Unnamed: 0,col1,col2,col3
0,1,444,abc
3,4,444,xyz
1,2,555,def
2,3,666,ghi


__Find Null Values or Check for Null Values:__

In [77]:
df1.isnull().sum()



col1    0
col2    0
col3    0
dtype: int64

# Great Job!