<div style=" border-bottom: 8px solid #e3f56c; overflow: hidden; border-radius: 10px; height: 60px; width: 100%; display: flex;">
  <div style="height: 100%; width: 100%; background-color: #3800BB; float: left; text-align: center; display: flex; justify-content: left; align-items: center; font-size: 40px; ">
    <b><span style="color: #FFFFFF; padding: 20px 20px;">Automatic Feature Generation</span></b>
  </div>
</div>

<div class="alert" style="background-color: #FEDAD5; border-left: 8px solid #B12111; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">
  <h5 style="font-size: 16px; margin-bottom: 10px;">
    <strong> Contents </strong>
  </h5>
<hr>
  <p><font size="3" face="Arial" font-size="large">
  <ul type="square">

  <li> Featuretools – for data in the form of a SQL database.  </li>
  <li> GeoPandas – for working with geospatial data.  </li>
  <li> Karateclub and NetworkX – for graphs.  </li>
  <li> Tsfresh – for time series.  </li>
  <li> Conclusions and summary.  </li>
  
  </ul>
  </font></p>

</div>

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

* After extracting all meaningful visual dependencies from the raw data, and if no valuable insights have been obtained, the next logical step is to examine combinations of existing features — such as `products`, `sums`, `averages`, or `frequencies` of categorical variables.  
* Although this process may appear labor-intensive, it can be fully automated using specialized tools.
* This notebook focuses on libraries designed for automatic feature generation.

</div>

<div class="alert" style="background-color:rgb(0, 0, 0); border-left: 8px solid #B12111; padding: 14px; border-radius: 8px; font-size: 14px; color:rgb(255, 255, 255);">

!pip install featuretools -q

</div>

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

`Featuretools` is designed for data in the form of a SQL-style database — that is, multiple tables linked by ID fields.

</div>

<img src='../imgs/05.1.01_1.png' width='600px'>

In [6]:
import pandas as pd
import featuretools as ft
from classes import Paths

In [7]:
paths = Paths()

In [8]:
paths = Paths()

path_car_info = paths.car_train
path_rides_info = paths.rides_info
path_driver_info = paths.driver_info
path_fix_info = paths.fix_info

In [9]:
car_info = pd.read_csv(path_car_info)
rides_info = pd.read_csv(path_rides_info)
driver_info = pd.read_csv(path_driver_info)
fix_info = pd.read_csv(path_fix_info)

In [10]:
print('car_info', car_info.shape)
display(car_info.head(10))
print('rides_info', rides_info.shape)
display(rides_info.head(10))
print('driver_info', driver_info.shape)
display(driver_info.head(10))
print('fix_info', fix_info.shape)
display(fix_info.head(10))

car_info (2337, 10)


Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class
0,y13744087j,Kia Rio X-line,economy,petrol,3.78,2015,76163,2021,108.53,another_bug
1,O41613818T,VW Polo VI,economy,petrol,3.9,2015,78218,2021,35.2,electro_bug
2,d-2109686j,Renault Sandero,standart,petrol,6.3,2012,23340,2017,38.62,gear_stick
3,u29695600e,Mercedes-Benz GLC,business,petrol,4.04,2011,1263,2020,30.34,engine_fuel
4,N-8915870N,Renault Sandero,standart,petrol,4.7,2012,26428,2017,30.45,engine_fuel
5,b12101843B,Skoda Rapid,economy,petrol,2.36,2013,42176,2018,50.93,engine_ignition
6,Q-9368117S,Nissan Qashqai,standart,petrol,5.32,2012,24611,2014,54.79,engine_overheat
7,O-2124190y,Tesla Model 3,premium,electro,3.9,2017,116872,2019,50.26,gear_stick
8,h16895544p,Kia Sportage,standart,petrol,3.5,2014,56384,2017,33.24,gear_stick
9,K77009462l,Smart ForFour,economy,petrol,4.56,2013,41309,2018,39.43,gear_stick


rides_info (739500, 14)


Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
0,o52317055h,A-1049127W,b1v,2020-01-01,4.95,21,268,36,113.548538,0,514.24692,0,1.11526,2.909
1,H41298704y,A-1049127W,T1U,2020-01-01,6.91,8,59,36,93.0,1,197.520662,0,1.650465,4.133
2,v88009926E,A-1049127W,g1p,2020-01-02,6.01,20,315,61,81.959675,0,1276.328206,0,2.599112,2.461
3,t14229455i,A-1049127W,S1c,2020-01-02,0.26,19,205,32,128.0,0,535.680831,0,3.216255,0.909
4,W17067612E,A-1049127W,X1b,2020-01-03,1.21,56,554,38,90.0,1,1729.143367,0,2.71655,-1.822
5,I45176130J,A-1049127W,j1v,2020-01-03,7.52,67,1068,28,36.0,2,363.209144,0,0.496265,-3.442
6,W11562554A,A-1049127W,A1g,2020-01-04,5.78,30,324,48,61.0,0,1314.257355,0,1.464346,-6.004
7,o13713369s,A-1049127W,B1n,2020-01-04,7.35,29,401,57,65.845512,0,1753.88842,0,0.497193,-6.474
8,y62286141d,A-1049127W,h1a,2020-01-05,0.12,64,893,38,114.0,1,2022.125012,0,-0.155147,-5.123
9,V28486769l,A-1049127W,p1e,2020-01-05,3.32,43,424,31,51.298365,1,1334.567248,0,-3.757628,-2.079


driver_info (15153, 7)


Unnamed: 0,age,user_rating,user_rides,user_time_accident,user_id,sex,first_ride_date
0,27,9.0,865,19.0,l17437965W,1,2019-4-2
1,46,7.9,2116,11.0,Z12362316j,0,2021-11-19
2,59,7.8,947,4.0,g11098715c,0,2021-1-15
3,37,7.0,18,4.0,U12618125q,0,2019-11-20
4,39,8.2,428,7.0,A14375829B,0,2019-7-23
5,21,9.9,831,22.0,L95976611S,1,2020-9-18
6,39,6.9,2293,5.0,z74338505G,0,2022-3-30
7,26,7.9,142,5.0,q11106749z,1,2019-12-22
8,18,9.3,425,18.0,r77865210A,1,2020-6-4
9,23,9.2,601,12.0,t10928335r,1,2020-7-18


fix_info (146000, 6)


Unnamed: 0,car_id,worker_id,fix_date,work_type,destroy_degree,work_duration
0,P17494612l,RJ,2020-6-20 2:14,reparking,8.0,49
1,N-1530212S,LM,2020-2-9 20:25,repair,10.0,48
2,B-1154399t,ND,2019-8-24 7:1,reparking,1.0,27
3,y13744087j,PG,2019-8-10 9:29,reparking,1.0,28
4,F12725233R,YC,2020-11-12 5:22,refuel_check,8.0,47
5,O41613818T,RW,2019-2-21 13:25,reparking,1.0,32
6,l-1139189J,PO,2020-3-2 19:11,reparking,1.0,28
7,d-2109686j,ML,2018-3-2 5:12,repair,7.4,39
8,u29695600e,QN,2020-2-2 20:10,reparking,10.0,64
9,U75286923j,KC,2019-9-2 6:32,reparking,1.0,24
