<div style=" border-bottom: 8px solid #e3f56c; overflow: hidden; border-radius: 10px; height: 60px; width: 100%; display: flex;">
  <div style="height: 100%; width: 100%; background-color: #3800BB; float: left; text-align: center; display: flex; justify-content: left; align-items: center; font-size: 40px; ">
    <b><span style="color: #FFFFFF; padding: 20px 20px;">Automatic Feature Generation</span></b>
  </div>
</div>

<div class="alert" style="background-color: #FEDAD5; border-left: 8px solid #B12111; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">
  <h5 style="font-size: 16px; margin-bottom: 10px;">
    <strong> Contents </strong>
  </h5>
<hr>
  <p><font size="3" face="Arial" font-size="large">
  <ul type="square">

  <li> Featuretools – for data in the form of a SQL database.  </li>
  <li> GeoPandas – for working with geospatial data.  </li>
  <li> Karateclub and NetworkX – for graphs.  </li>
  <li> Tsfresh – for time series.  </li>
  <li> Conclusions and summary.  </li>
  
  </ul>
  </font></p>

</div>

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

* After extracting all meaningful visual dependencies from the raw data, and if no valuable insights have been obtained, the next logical step is to examine combinations of existing features — such as `products`, `sums`, `averages`, or `frequencies` of categorical variables.  
* Although this process may appear labor-intensive, it can be fully automated using specialized tools.
* This notebook focuses on libraries designed for automatic feature generation.

</div>

<div class="alert alert-warning">

### **FeatureTools**

</div>

<div class="alert" style="background-color:rgb(0, 0, 0); border-left: 8px solid #B12111; padding: 14px; border-radius: 8px; font-size: 14px; color:rgb(255, 255, 255);">

!pip install featuretools -q

</div>

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

`Featuretools` is designed for data in the form of a SQL-style database — that is, multiple tables linked by ID fields.

</div>

<img src='../imgs/05.1.01_1.png' width='600px'>

In [1]:
import pandas as pd
import featuretools as ft
from classes import Paths

In [2]:
paths = Paths()

In [3]:
paths = Paths()

path_car_info = paths.car_train
path_rides_info = paths.rides_info
path_driver_info = paths.driver_info
path_fix_info = paths.fix_info

In [4]:
car_info = pd.read_csv(path_car_info)
rides_info = pd.read_csv(path_rides_info)
driver_info = pd.read_csv(path_driver_info)
fix_info = pd.read_csv(path_fix_info)

In [5]:
print('car_info', car_info.shape)
display(car_info.head(10))
print('rides_info', rides_info.shape)
display(rides_info.head(10))
print('driver_info', driver_info.shape)
display(driver_info.head(10))
print('fix_info', fix_info.shape)
display(fix_info.head(10))

car_info (2337, 10)


Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class
0,y13744087j,Kia Rio X-line,economy,petrol,3.78,2015,76163,2021,108.53,another_bug
1,O41613818T,VW Polo VI,economy,petrol,3.9,2015,78218,2021,35.2,electro_bug
2,d-2109686j,Renault Sandero,standart,petrol,6.3,2012,23340,2017,38.62,gear_stick
3,u29695600e,Mercedes-Benz GLC,business,petrol,4.04,2011,1263,2020,30.34,engine_fuel
4,N-8915870N,Renault Sandero,standart,petrol,4.7,2012,26428,2017,30.45,engine_fuel
5,b12101843B,Skoda Rapid,economy,petrol,2.36,2013,42176,2018,50.93,engine_ignition
6,Q-9368117S,Nissan Qashqai,standart,petrol,5.32,2012,24611,2014,54.79,engine_overheat
7,O-2124190y,Tesla Model 3,premium,electro,3.9,2017,116872,2019,50.26,gear_stick
8,h16895544p,Kia Sportage,standart,petrol,3.5,2014,56384,2017,33.24,gear_stick
9,K77009462l,Smart ForFour,economy,petrol,4.56,2013,41309,2018,39.43,gear_stick


rides_info (739500, 14)


Unnamed: 0,user_id,car_id,ride_id,ride_date,rating,ride_duration,ride_cost,speed_avg,speed_max,stop_times,distance,refueling,user_ride_quality,deviation_normal
0,o52317055h,A-1049127W,b1v,2020-01-01,4.95,21,268,36,113.548538,0,514.24692,0,1.11526,2.909
1,H41298704y,A-1049127W,T1U,2020-01-01,6.91,8,59,36,93.0,1,197.520662,0,1.650465,4.133
2,v88009926E,A-1049127W,g1p,2020-01-02,6.01,20,315,61,81.959675,0,1276.328206,0,2.599112,2.461
3,t14229455i,A-1049127W,S1c,2020-01-02,0.26,19,205,32,128.0,0,535.680831,0,3.216255,0.909
4,W17067612E,A-1049127W,X1b,2020-01-03,1.21,56,554,38,90.0,1,1729.143367,0,2.71655,-1.822
5,I45176130J,A-1049127W,j1v,2020-01-03,7.52,67,1068,28,36.0,2,363.209144,0,0.496265,-3.442
6,W11562554A,A-1049127W,A1g,2020-01-04,5.78,30,324,48,61.0,0,1314.257355,0,1.464346,-6.004
7,o13713369s,A-1049127W,B1n,2020-01-04,7.35,29,401,57,65.845512,0,1753.88842,0,0.497193,-6.474
8,y62286141d,A-1049127W,h1a,2020-01-05,0.12,64,893,38,114.0,1,2022.125012,0,-0.155147,-5.123
9,V28486769l,A-1049127W,p1e,2020-01-05,3.32,43,424,31,51.298365,1,1334.567248,0,-3.757628,-2.079


driver_info (15153, 7)


Unnamed: 0,age,user_rating,user_rides,user_time_accident,user_id,sex,first_ride_date
0,27,9.0,865,19.0,l17437965W,1,2019-4-2
1,46,7.9,2116,11.0,Z12362316j,0,2021-11-19
2,59,7.8,947,4.0,g11098715c,0,2021-1-15
3,37,7.0,18,4.0,U12618125q,0,2019-11-20
4,39,8.2,428,7.0,A14375829B,0,2019-7-23
5,21,9.9,831,22.0,L95976611S,1,2020-9-18
6,39,6.9,2293,5.0,z74338505G,0,2022-3-30
7,26,7.9,142,5.0,q11106749z,1,2019-12-22
8,18,9.3,425,18.0,r77865210A,1,2020-6-4
9,23,9.2,601,12.0,t10928335r,1,2020-7-18


fix_info (146000, 6)


Unnamed: 0,car_id,worker_id,fix_date,work_type,destroy_degree,work_duration
0,P17494612l,RJ,2020-6-20 2:14,reparking,8.0,49
1,N-1530212S,LM,2020-2-9 20:25,repair,10.0,48
2,B-1154399t,ND,2019-8-24 7:1,reparking,1.0,27
3,y13744087j,PG,2019-8-10 9:29,reparking,1.0,28
4,F12725233R,YC,2020-11-12 5:22,refuel_check,8.0,47
5,O41613818T,RW,2019-2-21 13:25,reparking,1.0,32
6,l-1139189J,PO,2020-3-2 19:11,reparking,1.0,28
7,d-2109686j,ML,2018-3-2 5:12,repair,7.4,39
8,u29695600e,QN,2020-2-2 20:10,reparking,10.0,64
9,U75286923j,KC,2019-9-2 6:32,reparking,1.0,24


<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

To begin, we need to create an `EntitySet`, which will contain our tables and the relationships between them.
</div>

In [6]:
es = ft.EntitySet(id="car_rides")
es

Entityset: car_rides
  DataFrames:
  Relationships:
    No relationships

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

Next, we will add tables to the `EntitySet`.

Some columns in these tables — such as `model`, `fuel_type`, `car_type`, and `years_to_work` — are categorical. We typically do not want to apply operations like summation or averaging to them.  
Therefore, we will explicitly define data types for `Featuretools` using the `woodwork` library, which is installed along with `Featuretools`.

</div>

In [7]:
# let's see what types are presented
ft.list_logical_types()

Unnamed: 0,name,type_string,description,physical_type,standard_tags,is_default_type,is_registered,parent_type
0,Address,address,Represents Logical Types that contain address ...,string,{},True,True,
1,Age,age,Represents Logical Types that contain whole nu...,int64,{numeric},True,True,Integer
2,AgeFractional,age_fractional,Represents Logical Types that contain non-nega...,float64,{numeric},True,True,Double
3,AgeNullable,age_nullable,Represents Logical Types that contain whole nu...,Int64,{numeric},True,True,IntegerNullable
4,Boolean,boolean,Represents Logical Types that contain binary v...,bool,{},True,True,BooleanNullable
5,BooleanNullable,boolean_nullable,Represents Logical Types that contain binary v...,boolean,{},True,True,
6,Categorical,categorical,Represents Logical Types that contain unordere...,category,{category},True,True,
7,CountryCode,country_code,Represents Logical Types that use the ISO-3166...,category,{category},True,True,Categorical
8,CurrencyCode,currency_code,Represents Logical Types that use the ISO-4217...,category,{category},True,True,Categorical
9,Datetime,datetime,Represents Logical Types that contain date and...,datetime64[ns],{},True,True,


<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

We add our tables to the `entity_dataset` using the `add_dataframe` method, specifying data types for non-numeric columns.
</div>

In [8]:
from woodwork.logical_types import Categorical, Datetime, Age, Double

es = es.add_dataframe(
    dataframe_name="cars",
    dataframe=car_info,
    index="car_id",
    logical_types={"car_type": Categorical, 'fuel_type': Categorical, 'model': Categorical}
    )

es = es.add_dataframe(
    dataframe_name="rides",
    dataframe=rides_info.drop(['ride_id'], axis=1),
    index='index',
    time_index="ride_date",
    )

es = es.add_dataframe(
    dataframe_name="drivers",
    dataframe=driver_info,
    index="user_id",
    logical_types={"sex": Categorical, "first_ride_date": Datetime, "age": Age}
    )

es = es.add_dataframe(
    dataframe_name="fixes",
    dataframe=fix_info,
    index="index",
    logical_types={"work_type": Categorical, "worker_id":Categorical}
    )
es

  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(
  pd.to_datetime(


Entityset: car_rides
  DataFrames:
    cars [Rows: 2337, Columns: 10]
    rides [Rows: 739500, Columns: 14]
    drivers [Rows: 15153, Columns: 7]
    fixes [Rows: 146000, Columns: 7]
  Relationships:
    No relationships

<div class="alert" style="background-color: #FEF9E7; border-left: 8px solid #D4AC0D; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

Let's add relationships between dataframes
</div>

In [9]:
es = es.add_relationship("cars", "car_id", "rides", "car_id")
es = es.add_relationship("drivers", "user_id", "rides", "user_id")
es = es.add_relationship("cars", "car_id", "fixes", "car_id")

es



Entityset: car_rides
  DataFrames:
    cars [Rows: 2337, Columns: 10]
    rides [Rows: 739500, Columns: 14]
    drivers [Rows: 15153, Columns: 7]
    fixes [Rows: 146000, Columns: 7]
  Relationships:
    rides.car_id -> cars.car_id
    rides.user_id -> drivers.user_id
    fixes.car_id -> cars.car_id

<div class="alert" style="background-color: #FEF9E7; border-left: 8px solid #D4AC0D; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

Generate features for cars
</div>

In [10]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="cars",
    max_depth=1,
)
feature_matrix.head()

  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)


Unnamed: 0_level_0,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,COUNT(rides),...,MODE(fixes.work_type),MODE(fixes.worker_id),NUM_UNIQUE(fixes.work_type),NUM_UNIQUE(fixes.worker_id),SKEW(fixes.destroy_degree),SKEW(fixes.work_duration),STD(fixes.destroy_degree),STD(fixes.work_duration),SUM(fixes.destroy_degree),SUM(fixes.work_duration)
car_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
y13744087j,Kia Rio X-line,economy,petrol,3.78,2015,76163,2021,108.53,another_bug,174,...,reparking,LR,4,33,0.835907,0.826462,2.732847,10.171884,106.7,933.0
O41613818T,VW Polo VI,economy,petrol,3.9,2015,78218,2021,35.2,electro_bug,174,...,reparking,YH,5,34,0.997276,-0.296841,2.707233,8.574733,102.1,873.0
d-2109686j,Renault Sandero,standart,petrol,6.3,2012,23340,2017,38.62,gear_stick,174,...,repair,AP,5,35,0.472628,0.671481,2.978077,13.040983,130.9,915.0
u29695600e,Mercedes-Benz GLC,business,petrol,4.04,2011,1263,2020,30.34,engine_fuel,174,...,repair,LM,4,34,0.492743,0.63949,3.23775,14.764994,143.0,1007.0
N-8915870N,Renault Sandero,standart,petrol,4.7,2012,26428,2017,30.45,engine_fuel,174,...,repair,CD,4,34,0.478043,1.341642,3.216758,12.659537,135.8,981.0


In [11]:
feature_matrix.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2337 entries, y13744087j to z-1337463D
Data columns (total 87 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   model                          2337 non-null   category
 1   car_type                       2337 non-null   category
 2   fuel_type                      2337 non-null   category
 3   car_rating                     2337 non-null   float64 
 4   year_to_start                  2337 non-null   int64   
 5   riders                         2337 non-null   int64   
 6   year_to_work                   2337 non-null   int64   
 7   target_reg                     2337 non-null   float64 
 8   target_class                   2337 non-null   category
 9   COUNT(rides)                   2337 non-null   Int64   
 10  MAX(rides.deviation_normal)    2337 non-null   float64 
 11  MAX(rides.distance)            2337 non-null   float64 
 12  MAX(rides.rating)       

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

We can also generate only needed features
</div>

In [12]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="cars",
    agg_primitives=["mode", "count"], # limit number of features
    max_depth=1, # limit depth
)
feature_matrix.head()

Unnamed: 0_level_0,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,COUNT(rides),COUNT(fixes),MODE(fixes.work_type),MODE(fixes.worker_id)
car_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
y13744087j,Kia Rio X-line,economy,petrol,3.78,2015,76163,2021,108.53,another_bug,174,35,reparking,LR
O41613818T,VW Polo VI,economy,petrol,3.9,2015,78218,2021,35.2,electro_bug,174,35,reparking,YH
d-2109686j,Renault Sandero,standart,petrol,6.3,2012,23340,2017,38.62,gear_stick,174,35,repair,AP
u29695600e,Mercedes-Benz GLC,business,petrol,4.04,2011,1263,2020,30.34,engine_fuel,174,35,repair,LM
N-8915870N,Renault Sandero,standart,petrol,4.7,2012,26428,2017,30.45,engine_fuel,174,35,repair,CD


<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

We can also control the complexity of generated features using the `max_depth` parameter. This allows features to be created not only within a single table but also by combining attributes from related tables.  

For debugging purposes, computations can be performed on a limited number of examples — instead of the entire dataset — by specifying a list of instance IDs using the `instance_ids` parameter.
</div>

In [13]:
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="cars",
    agg_primitives=["mean", "sum", "mode"],
    instance_ids=["y13744087j", "d-2109686j", "N-8915870N"],
    max_depth=2,
)
feature_matrix.head()

  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)
  ).agg(to_agg)


Unnamed: 0_level_0,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,MEAN(rides.deviation_normal),...,MEAN(rides.drivers.user_time_accident),MODE(rides.DAY(ride_date)),MODE(rides.MONTH(ride_date)),MODE(rides.WEEKDAY(ride_date)),MODE(rides.YEAR(ride_date)),MODE(rides.drivers.sex),SUM(rides.drivers.age),SUM(rides.drivers.user_rating),SUM(rides.drivers.user_rides),SUM(rides.drivers.user_time_accident)
car_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
y13744087j,Kia Rio X-line,economy,petrol,3.78,2015,76163,2021,108.53,another_bug,-0.120391,...,17.724138,1,1,2,2020,1,5831.0,1432.0,144078.0,2056.0
d-2109686j,Renault Sandero,standart,petrol,6.3,2012,23340,2017,38.62,gear_stick,-2.223954,...,9.775862,1,1,2,2020,1,5714.0,1364.7,163567.0,1701.0
N-8915870N,Renault Sandero,standart,petrol,4.7,2012,26428,2017,30.45,engine_fuel,12.455678,...,15.758333,1,1,2,2020,0,5968.0,1411.5,155944.0,1891.0


<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

All available types of features can be viewed in `list_primitives`.
</div>

In [14]:
ft.list_primitives().head()

Unnamed: 0,name,type,description,valid_inputs,return_type
0,time_since_last_min,aggregation,Calculates the time since the minimum value oc...,"<ColumnSchema (Semantic Tags = ['numeric'])>, ...",<ColumnSchema (Logical Type = Double) (Semanti...
1,kurtosis,aggregation,Calculates the kurtosis for a list of numbers,<ColumnSchema (Logical Type = Double) (Semanti...,<ColumnSchema (Logical Type = Double) (Semanti...
2,num_false_since_last_true,aggregation,Calculates the number of 'False' values since ...,<ColumnSchema (Logical Type = Boolean)>,<ColumnSchema (Logical Type = IntegerNullable)...
3,time_since_last,aggregation,Calculates the time elapsed since the last dat...,<ColumnSchema (Logical Type = Datetime) (Seman...,<ColumnSchema (Logical Type = Double) (Semanti...
4,num_consecutive_less_mean,aggregation,Determines the length of the longest subsequen...,<ColumnSchema (Semantic Tags = ['numeric'])>,<ColumnSchema (Logical Type = IntegerNullable)...


<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

`Featuretools` includes a built-in feature selection mechanism that provides three main functions:

* `ft.selection.remove_highly_null_features()` – removes features with a high percentage of missing values  
* `ft.selection.remove_single_value_features()` – removes constant (single-value) features  
* `ft.selection.remove_highly_correlated_features()` – removes highly correlated features  

Each function takes a DataFrame as an argument and behaves according to its name.

The library also offers many other useful features — refer to the [official documentation](https://featuretools.alteryx.com/en/stable/index.html) for more details.

</div>

<div class="alert alert-warning">

### **GeoPandas**

</div>

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">


Geographic coordinates or location-based features are often present in datasets used in data analysis competitions.  
To work with such data, the `GeoPandas` library can be used. It combines the functionality of `Pandas` with `Shapely`, a library for geospatial computations.
</div>

<div class="alert" style="background-color: #FEF9E7; border-left: 8px solid #D4AC0D; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

Let's load `California Housing Dataset` - data for house pricing prediction
</div>

In [None]:
import geopandas as gpd
# from sklearn.datasets import fetch_california_housing

In [29]:
from classes import Paths
paths = Paths()
path_df = paths.cal_housing_data
path_df_domain = paths.cal_housing_domain

In [30]:
domain_lines = []
with open(path_df_domain, "rb") as f:
    for i in range(9):
        t = f.readline()
        domain_lines.append(t.decode("utf-8").strip())
domain_lines

cal_h_col_names = []
for i in domain_lines:
    cal_h_col_names.append(i.split(':')[0])
cal_h_col_names

['longitude',
 'latitude',
 'housingMedianAge',
 'totalRooms',
 'totalBedrooms',
 'population',
 'households',
 'medianIncome',
 'medianHouseValue']

In [36]:
df = pd.read_csv(path_df, names=cal_h_col_names)
print(df.shape)
df.head(10)

(20640, 9)


Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0


<div class="alert" style="background-color: #FEF9E7; border-left: 8px solid #D4AC0D; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

We are interested in `longitude` and `latitude` columns

</div>

In [37]:
# let's create GeoDataFrame
gdf = gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df['longitude'], df['latitude']),
        crs=4326
    ).to_crs(epsg=3857)
gdf.head(3)

Unnamed: 0,longitude,latitude,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue,geometry
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,POINT (-13606581.36 4562487.679)
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,POINT (-13605468.165 4559667.342)
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,POINT (-13607694.555 4558257.461)


<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

As you can see, it is not significantly different from a regular DataFrame. The only distinction is the `geometry` column, which is a `GeoSeries` object. This object provides additional attributes and methods specific to geospatial data.
</div>

In [43]:
type(gdf['geometry'])

geopandas.geoseries.GeoSeries

In [44]:
gpd.geoseries.GeoSeries?

[0;31mInit signature:[0m
[0mgpd[0m[0;34m.[0m[0mgeoseries[0m[0;34m.[0m[0mGeoSeries[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcrs[0m[0;34m:[0m [0;34m'Optional[Any]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A Series object designed to store shapely geometry objects.

Parameters
----------
data : array-like, dict, scalar value
    The geometries to store in the GeoSeries.
index : array-like or Index
    The index for the GeoSeries.
crs : value (optional)
    Coordinate Reference System of the geometry objects. Can be anything accepted by
    :meth:`pyproj.CRS.from_user_input() <pyproj.crs.CRS.from_user_input>`,
    such as an authority string (eg "EPSG:4326") or

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

Inside the `geometry` column are `Point` objects from the `shapely` library.
</div>

In [45]:
type(gdf['geometry'][0])

shapely.geometry.point.Point

In [47]:
import shapely
shapely.geometry.point.Point?

[0;31mInit signature:[0m [0mshapely[0m[0;34m.[0m[0mgeometry[0m[0;34m.[0m[0mpoint[0m[0;34m.[0m[0mPoint[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A geometry type that represents a single coordinate.

Each coordinate has x, y and possibly z and/or m values.

A point is a zero-dimensional feature and has zero length and zero area.

Parameters
----------
args : float, or sequence of floats
    The coordinates can either be passed as a single parameter, or as
    individual float values using multiple parameters:

    1) 1 parameter: a sequence or array-like of with 2 or 3 values.
    2) 2 or 3 parameters (float): x, y, and possibly z.

Attributes
----------
x, y, z, m : float
    Coordinate values

Examples
--------
Constructing the Point using separate parameters for x and y:

>>> from shapely import Point
>>> p = Point(1.0, -1.0)

Constructing the Point using a list of x, y coordinates:

>>> p = Point([1.0, -1.0

<div class="alert" style="background-color:  #E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

It’s worth noting that GeoPandas supports not only points, but also linestrings, polygons, and mixed geometry types.

For more details, see: https://shapely.readthedocs.io/en/stable/manual.html

</div>
