### <center> Pandas </center>

In [1]:
import pandas as pd # pandas: lib for tabular data
import numpy as np  # numpy: lib for math-l calculations

<div class="alert alert-info">
<b>Useful Information</b>


- __Delete rows or columns:__ `df.drop(name, axis=0, inplace=True)`  - `name` row index or column name. `axis=0` means, that we delete row, in order to delete columns instead of index we provide columns' names list and `axis=1` as arguments.

    
- __Drop duplicates:__ `df.drop_duplicates()` keep in mind `inplace=True`. If we want to drop duplicates by specific column, we provide `subset` argument with its name or list of names of specific columns.

    
    
- __Insert into table (specifying place where to insert):__ `df.insert(loc, column, value)`.  Here `loc` (where), `column` (name), `value` (what value). By default, columns are inserted into the end.

    
    
- __Rename column:__ `df.rename(columns = {'old_name':'new_name'}, inplace=True)`</div>


### <center> 🔍 Exploratory Data Analysis (EDA) </center>

<p id="2"></p>

In this part we have 3 sections:
- Load Data;
- Primary Overview;
- Analysis

#### `Load Data`

<p id="3"></p>
<div class="alert alert-info">

**[pd.read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)** - reads data into `DataFrame` from `csv`-file.

In [2]:
class Paths:
    car_train = '../data/car_train.csv'
    car_test = '../data/car_test.csv'
    driver_info = '../data/driver_info.csv'
    fix_info ='../data/fix_info.csv'
    rides_info ='../data/rides_info.csv'
paths = Paths()

In [3]:
path = paths.rides_info
rides_info = pd.read_csv(path)
rides_info.head(10)

Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
0,o52317055h,A-1049127W,b1v,2020-01-01,4.95,21,268,36,113.548538,0,514.24692,0,1.11526,2.909
1,H41298704y,A-1049127W,T1U,2020-01-01,6.91,8,59,36,93.0,1,197.520662,0,1.650465,4.133
2,v88009926E,A-1049127W,g1p,2020-01-02,6.01,20,315,61,81.959675,0,1276.328206,0,2.599112,2.461
3,t14229455i,A-1049127W,S1c,2020-01-02,0.26,19,205,32,128.0,0,535.680831,0,3.216255,0.909
4,W17067612E,A-1049127W,X1b,2020-01-03,1.21,56,554,38,90.0,1,1729.143367,0,2.71655,-1.822
5,I45176130J,A-1049127W,j1v,2020-01-03,7.52,67,1068,28,36.0,2,363.209144,0,0.496265,-3.442
6,W11562554A,A-1049127W,A1g,2020-01-04,5.78,30,324,48,61.0,0,1314.257355,0,1.464346,-6.004
7,o13713369s,A-1049127W,B1n,2020-01-04,7.35,29,401,57,65.845512,0,1753.88842,0,0.497193,-6.474
8,y62286141d,A-1049127W,h1a,2020-01-05,0.12,64,893,38,114.0,1,2022.125012,0,-0.155147,-5.123
9,V28486769l,A-1049127W,p1e,2020-01-05,3.32,43,424,31,51.298365,1,1334.567248,0,-3.757628,-2.079


<div class="alert alert-info"><b>Useful Information</b>

- If when reading a file you see columns with the name of the type `Unnamed: 0`, then this is the saved index when writing the file. You can load indices from this column without creating them from scratch by default, using the `index_col=0` parameter, specifying the column number to be used for indexes.


- Useful string output methods: **df.head(n)** / **df.sample(n)** / **df.tail(n)** - return the first, random or last `n` rows of the dataframe. A `DataFrame` is a two–dimensional data structure with different types of columns.


- Counting the number of occurrences of each value: **df.value_counts()** - the function is used to get unique values and the number of their occurrence in the form of a `Series`. By default, it is sorted in descending order of occurrence `(ascending=False)`. You can change the sorting via `ascending=True`. Numeric columns can also be split into bins via the `bins` parameter.

#### `Primary Overview & Analysis`

<div class="alert alert-info"><p id="4"></p>

**[df.sample()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)** - returns random sample of rows from a specific dataframe. `n` argument stands for number of rows displayed.

In [4]:
rides_info.sample(5)

Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
505938,P17367280Z,j18905567d,c1x,2020-03-04,4.68,35,379,39,61.0,1,1028.238746,0,18.97434,-3.9
411034,S41712899i,d-1766392g,F1k,2020-01-25,6.26,32,411,29,55.0,0,597.545192,0,9.485338,-0.0
303641,C75049554a,V-9683916q,Z1i,2020-01-06,1.45,21,164,63,79.888165,0,1378.883247,0,5.04065,-0.373
449005,V16037607k,f19927901y,f1v,2020-02-14,5.3,26,255,53,78.0,0,1268.817667,0,-1.299351,13.363
490598,O70169471V,i14933910A,i1f,2020-02-18,3.36,27,265,42,73.557628,0,1186.051319,0,-2.368698,23.26


<div class="alert alert-info">

**[df.head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)** - returns first rows of dataframe. By default, it is set to display 5 rows, but this can be changed by specifying the `n`-argument of method.

<div class="alert alert-info">

More often people use the **head()**-method, but it is better to look at the dataframe for the first time using **sample()**. This gives us opportunity to look at different samples and it can be useful to see any bugs.

In [5]:
rides_info.head()

Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
0,o52317055h,A-1049127W,b1v,2020-01-01,4.95,21,268,36,113.548538,0,514.24692,0,1.11526,2.909
1,H41298704y,A-1049127W,T1U,2020-01-01,6.91,8,59,36,93.0,1,197.520662,0,1.650465,4.133
2,v88009926E,A-1049127W,g1p,2020-01-02,6.01,20,315,61,81.959675,0,1276.328206,0,2.599112,2.461
3,t14229455i,A-1049127W,S1c,2020-01-02,0.26,19,205,32,128.0,0,535.680831,0,3.216255,0.909
4,W17067612E,A-1049127W,X1b,2020-01-03,1.21,56,554,38,90.0,1,1729.143367,0,2.71655,-1.822


<div class="alert alert-success">
    <b>Checkpoint Conclusions</b>
    <ul>
    <li>The data loaded</li>
    <li>The column names from the first row loaded</li>
    <li>The indexes of the first column look adequate</li>
    <li>The presence of indexes that need to be deleted was not detected</li>
    <li>By random five rows the presence of any errors was not detected</li></ul>
</div>

<div class="alert alert-info">

**[df.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)** - Displays a brief summary-information on the dataframe.

In this specific case the dataframe is not too big, but it can be. We will review the methods of the `Pandas` library in this small example, which can be applied by analogy to other datasets yourself.

In [6]:
rides_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 739500 entries, 0 to 739499
Data columns (total 14 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   user_id            739500 non-null  object 
 1   car_id             739500 non-null  object 
 2   ride_id            739500 non-null  object 
 3   ride_date          739500 non-null  object 
 4   rating             739500 non-null  float64
 5   ride_duration      739500 non-null  int64  
 6   ride_cost          739500 non-null  int64  
 7   speed_avg          739500 non-null  int64  
 8   speed_max          736139 non-null  float64
 9   stop_times         739500 non-null  int64  
 10  distance           739500 non-null  float64
 11  refueling          739500 non-null  int64  
 12  user_ride_quality  736872 non-null  float64
 13  deviation_normal   739500 non-null  float64
dtypes: float64(5), int64(5), object(4)
memory usage: 79.0+ MB


<div class="alert alert-success"><b>Checkpoint Conclusions</b>

* The dataframe has only 14 columns
* In the dataframe, four columns are of the `object` type
* Five columns are of the `float64` type
* Five columns are of the `int64` type
* There are 50,000 rows
* We have missing values (NaN's) in columns `speed_max` and `user_ride_quality`
* The volume occupied by the dataframe is `5.3+ MB` (this information may be useful to have an idea in advance of what operations can be performed with the dataframe, taking into account the RAM available)

<div class="alert alert-info">

**[df.columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html)** - Displays the column labels of the DataFrame.

Sometimes the number of columns is very large, so that when they are output, they all do not fit into standard output via **df.info()**. Then we can use the attribute to view the indexes of column names.

In [7]:
rides_info.columns

Index(['user_id', 'car_id', 'ride_id', 'ride_date', 'rating', 'ride_duration',
       'ride_cost', 'speed_avg', 'speed_max', 'stop_times', 'distance',
       'refueling', 'user_ride_quality', 'deviation_normal'],
      dtype='object')

It is much clearer and more convenient to work with a list of column names

In [8]:
column_names = [feat for feat in rides_info.columns] # list(rides_info.columns)
column_names, len(column_names)

(['user_id',
  'car_id',
  'ride_id',
  'ride_date',
  'rating',
  'ride_duration',
  'ride_cost',
  'speed_avg',
  'speed_max',
  'stop_times',
  'distance',
  'refueling',
  'user_ride_quality',
  'deviation_normal'],
 14)

<div class="alert alert-success">

<b>Checkpoint Conclusions</b>
<ul>

<li>14 columns</li>

<li>

All column names are in lowercase and do not contain special characters (so they do not need to be renamed to access them using a dot `df.column_name`) </li>

</ul>

</div>

##### Working with Numerical Data

<div class="alert alert-info">
<p id="5"></p>

**[df.describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)** - displays descriptive statistics on numeric columns of the dataframe. Descriptive statistics include:
- `count` - the number of non-empty values
- `mean` - arithmetic mean
- `std` - standard deviation
- `min`, `max` - minimum and maximum
- 25%, 50%, 75% - corresponding quartiles

`NaN` - values are excluded from statistics automatically

In [9]:
rides_info.describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,739500.0,4.47,2.13,0.0,3.12,4.47,5.83,10.0
ride_duration,739500.0,1669.62,6356.64,2.0,27.0,44.0,69.0,43956.0
ride_cost,739500.0,20931.08,87315.37,7.0,298.0,505.0,888.0,2007346.0
speed_avg,739500.0,47.01,12.69,25.0,38.0,46.0,52.0,100.0
speed_max,736139.0,83.79,29.64,27.9,64.0,75.28,97.0,209.98
stop_times,739500.0,1.34,2.37,0.0,0.0,1.0,2.0,23.0
distance,739500.0,78395.67,315814.74,1.84,792.6,1452.54,2247.8,3606050.58
refueling,739500.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
user_ride_quality,736872.0,-0.27,11.66,-65.78,-6.98,0.07,6.46,69.57
deviation_normal,739500.0,-1.34,19.58,-126.88,-9.36,0.0,7.54,98.74


<div class="alert alert-success">
    <b>Checkpoint Conclusions</b>

- Description of the dataframe columns (`speed_avg` - the average speed on the taxi route, `rating` - the rating received by the driver for the rides, `ride_duration` - the duration of the rides, `ride_cost` - the cost of the rides, `stop_times` - the number of stops)
- The average speed along the route for all ridess is about `47 km/h`, and the maximum average speed is `100 km/h`.
- At least in half of the cases there was one stop during the rides
- The maximum rating is about `10 points`, the minimum is `0 points`. More than half of all ratings with ratings `from 3 to 5 points`.
- It is impossible to draw unambiguous conclusions about the dimensions of the duration and cost of rides, they have probably been modified and this requires additional data analysis and visualization

</div>

<div class="alert alert-info">

**describe()** can also be used for categorical features. All we need to do is provide `include` argument with `'object'` value

</div>

In [10]:
rides_info.describe(include='object').T

Unnamed: 0,count,unique,top,freq
user_id,739500,15153,n50223955s,153
car_id,739500,4250,A-1049127W,174
ride_id,739500,2704,k1y,330
ride_date,739500,93,2020-01-01,8039


<div class="alert alert-info">

**[df.value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)** - the function is used to get unique values and the number of their occurrences in the form of a `Series`. By default, it is sorted in descending order (`ascending=False`). This can be changed by sorting the output data via `ascending=True`. Numeric columns can also be split into bins via the `bins` parameter.

</div>

In [11]:
rides_info.ride_id.value_counts()[:5]

ride_id
k1y    330
V1C    329
l1j    326
Z1j    325
O1e    324
Name: count, dtype: int64

<div class="alert alert-success">

**Checkpoint Conclusion:** the most common route code is `H1v` - it has 330 rides

</div>

<div class="alert alert-warning">

Let's see how the number of rides is distributed according to the ride rating by breaking it into bins:

</div>

In [12]:
rides_info.rating.value_counts(bins=4)

(2.5, 5.0]       321030
(5.0, 7.5]       223169
(-0.011, 2.5]    131475
(7.5, 10.0]       63826
Name: count, dtype: int64

<div class="alert alert-success">

**Checkpoint Conclusion:** %26.4 of all rides have ratings << 2.5 and >> 7.5

</div>

<div class="alert alert-info">

**[duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)** - a method that will allow us to identify duplication in the data. Returns Boolean values denoting duplicate strings. We can ignore some specific columns, while using the method.

<div class="alert alert-warning">

**Example:**
</div>

In [13]:
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
display(df.duplicated())

"""We can also use it on a specific column/columns"""
df.duplicated(subset=['brand'])

0    False
1     True
2    False
3    False
4    False
dtype: bool

0    False
1     True
2    False
3     True
4     True
dtype: bool

### <center> 📤 Getting Useful Information from Data </center>

<p id="3"></p>

5 sections:
- Sorting Data;
- Filtering;
- Dealing with missing records;
- Index data;
- Functions;

#### `Sorting DataFrame's`
<div class="alert alert-info">
    
`DataFrame`'s can be sorted. Let's see an example:</p>

In [15]:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
         'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
         'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
         'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

#create DataFrame from dictionary
df = pd.DataFrame(data, index=labels)
df

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


**By values**

In [16]:
df.sort_values(by='age') 

Unnamed: 0,animal,age,visits,priority
c,snake,0.5,2,no
f,cat,2.0,3,no
a,cat,2.5,1,yes
b,cat,3.0,3,yes
j,dog,3.0,1,no
g,snake,4.5,1,no
e,dog,5.0,2,no
i,dog,7.0,2,no
d,dog,,3,yes
h,cat,,1,yes


**By index**

In [17]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,visits,priority,animal,age
a,1,yes,cat,2.5
b,3,yes,cat,3.0
c,2,no,snake,0.5
d,3,yes,dog,
e,2,no,dog,5.0
f,3,no,cat,2.0
g,1,no,snake,4.5
h,1,yes,cat,
i,2,no,dog,7.0
j,1,no,dog,3.0


<div class="alert alert-info">
    
By default `ascending` parameter is set to be `True`. 
If we want to sort data by two or more columns, we can provide `by` and `ascending` parameters with lists of `column names` and `booleans` for each column `ascending`-value.
`inplace` parameter is set to be `False`, by default, too.

#### `Filtering DataFrames`

In order to filter DataFrame we can use logical expressions, which create CTE (Common Table Expression) with given conditions. [index, true/false values]

In [19]:
df.visits>2

a    False
b     True
c    False
d     True
e    False
f     True
g    False
h    False
i    False
j    False
Name: visits, dtype: bool

In [20]:
df[df.visits>2]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
d,dog,,3,yes
f,cat,2.0,3,no


If we want to set more than one condition, then every condition is taken into brackets `()`
- and is defined with `&`
- or is defined with `|`

In [21]:
df.loc[(df.visits > 2) & (df.priority == 'yes')]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
d,dog,,3,yes


In [22]:
df.loc[(df.visits > 2) | (df.age > 5)]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
d,dog,,3,yes
f,cat,2.0,3,no
i,dog,7.0,2,no


In order for us to output a part of the DataFrame with cheap prices (trip prices below 500), let's just try to write down this condition and output the part of the dataframe that meets this condition

In [23]:
rule_cheap = rides_info.ride_cost < 500
rides_info[rule_cheap].head().round(2)

Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
0,o52317055h,A-1049127W,b1v,2020-01-01,4.95,21,268,36,113.55,0,514.25,0,1.12,2.91
1,H41298704y,A-1049127W,T1U,2020-01-01,6.91,8,59,36,93.0,1,197.52,0,1.65,4.13
2,v88009926E,A-1049127W,g1p,2020-01-02,6.01,20,315,61,81.96,0,1276.33,0,2.6,2.46
3,t14229455i,A-1049127W,S1c,2020-01-02,0.26,19,205,32,128.0,0,535.68,0,3.22,0.91
6,W11562554A,A-1049127W,A1g,2020-01-04,5.78,30,324,48,61.0,0,1314.26,0,1.46,-6.0
