# Pandas Methods Via Cars

These are examples for methods of [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and some methods of [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). Later, in more complex transformations, some [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object methods.

If you learn these, that's great!

If you learn how to read the docs and learn new ones, that's even better! Pandas docs are generally good. Sometimes a bit too [long an convoluted](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html), but good.

Also important to note, that the pandas backend is [numpy](https://docs.scipy.org/doc/numpy/reference/) which is mostly nice and fast, won't cover it here.

## Reading/Writing Data

We collect our own sample data for these excercises, a dataframe of cars.

Its stored in parquet initially, but other [serializations](https://en.wikipedia.org/wiki/Serialization) are possible

generally, the factors for choosing a method are:
- is it human readable? (eg. csv is, pickle is not)
- is it fast to read/write? (eg pickle is, csv is not)
- what other programs might need it later? (csv is good for almost anythig, xlsx for one thing, feather is for nothing)
- pandas provides a plethora of reading and writing methods for dataframes (tables) ([from here on](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html))

parquet is small and quick, ideal for self-hosting, but needs some other package (like `pyarrow`) to work

In [1]:
import pandas as pd

In [2]:
# reading can be done from local storage, but also through html or even s3, gs, or SQL hosts for read_sql
df = pd.read_parquet(
    "https://borza-public-data.s3.eu-central-1.amazonaws.com/ha-data/ha-2019-05-18.parquet"
)  # or 20-08-01

## Exploring Data
- [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) (or [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) or even better, [sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html#pandas.DataFrame.sample) which has the very useful frac parameter)
- [T](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas.DataFrame.T) for transpose
- [isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html#pandas.DataFrame.isna) or [isnull](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html#pandas.DataFrame.isnull)
- [mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean)
- [sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_values.html#pandas.Series.sort_values) used for a Series here
- [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html#pandas.Series.value_counts)
- [dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes)
- [nunique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html#pandas.DataFrame.nunique)
- describe and other aggregations

In [3]:
# first n (now 10) rows
df.head(10)

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype,HIFI,FŰTHETŐ SZÉLVÉDŐ,...,KÖDLÁMPA,ASR,FŰTHETŐ TÜKÖR,LÉGZSÁK,ESP,ALUFELNI,KLÍMA,CD-RÁDIÓ,ÁLLÍTHATÓ KORMÁNY,CENTRÁLZÁR
0,4355000.0,citroen,82.0,1199.0,Benzin,,2018.0,c3,0,0,...,1,1,1,1,1,1,1,0,1,1
1,4710000.0,citroen,82.0,1199.0,Benzin,5.0,2018.0,c3_aircross,1,0,...,1,1,1,1,1,0,1,0,1,1
2,13490000.0,bmw,190.0,1995.0,Dízel,5.0,2018.0,x2,0,0,...,0,1,0,0,1,1,0,1,1,1
3,11990000.0,bmw,190.0,1995.0,Dízel,,2017.0,420,0,0,...,0,1,0,0,1,1,0,1,1,1
4,16990000.0,bmw,265.0,2993.0,Dízel,3700.0,2017.0,530,0,0,...,0,1,0,0,1,1,0,1,1,1
5,26650000.0,bmw,449.0,4395.0,Benzin,2600.0,2017.0,650,0,0,...,0,1,0,0,1,1,0,1,1,1
6,5215000.0,citroen,110.0,1199.0,Benzin,5.0,2019.0,berlingo,1,0,...,1,1,0,1,1,0,1,0,0,0
7,6155000.0,citroen,131.0,1199.0,Benzin,5.0,2019.0,grand_c4_spacetourer,1,0,...,1,1,0,0,1,0,1,0,1,1
8,4590000.0,citroen,82.0,1199.0,Benzin,,2019.0,c4_cactus,1,0,...,0,1,1,0,1,0,1,0,1,1
9,9990000.0,bmw,150.0,1995.0,Dízel,5.0,2018.0,318,0,0,...,0,1,0,0,1,1,0,1,1,1


In [4]:
# you might be tempted to find a way to increase the maximum number of columns shown.
# dont do that. transpose the DataFrame (head) like this
# last n rows (3) transposed
df.tail(3).T

Unnamed: 0,86860,86861,86862
parsed_price,690.0,2000.0,350.0
brand,ford,opel,toyota
horsepower,116.0,110.0,75.0
cc,1998.0,1598.0,1332.0
fuel,Dízel,Dízel,Benzin
kms,266000.0,200.0,86000.0
year,2003.0,2013.0,1998.0
cartype,mondeo,astra_j,starlet
HIFI,1,0,0
FŰTHETŐ SZÉLVÉDŐ,1,0,0


In [5]:
# 1
df.sample(20)

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype,HIFI,FŰTHETŐ SZÉLVÉDŐ,...,KÖDLÁMPA,ASR,FŰTHETŐ TÜKÖR,LÉGZSÁK,ESP,ALUFELNI,KLÍMA,CD-RÁDIÓ,ÁLLÍTHATÓ KORMÁNY,CENTRÁLZÁR
7910,330000.0,fiat,103.0,1581.0,Benzin,162000.0,2000.0,brava,0,0,...,0,0,0,0,0,0,1,0,1,0
35387,1290000.0,citroen,68.0,998.0,Benzin,84300.0,2014.0,c1,0,0,...,0,1,0,1,1,0,0,1,1,1
58235,2567000.0,honda,139.0,1799.0,Benzin,157972.0,2010.0,civic,0,0,...,1,1,1,0,1,1,1,1,1,1
13105,549000.0,fiat,60.0,1242.0,Benzin,121700.0,2002.0,punto,0,0,...,0,0,0,1,0,0,0,0,1,0
32846,1190000.0,mitsubishi,95.0,1332.0,Benzin,154000.0,2009.0,colt,0,0,...,0,0,0,1,0,0,1,1,1,1
55177,2299000.0,volvo,163.0,2401.0,Dízel,125000.0,2009.0,s60,0,0,...,0,1,0,0,1,1,0,1,1,1
57602,2499000.0,opel,140.0,1796.0,Benzin,124000.0,2010.0,zafira_b,1,0,...,1,1,0,0,1,1,1,1,1,1
25960,925000.0,toyota,90.0,1995.0,Dízel,231564.0,2004.0,corolla,0,0,...,1,0,0,1,0,0,1,1,1,1
35665,1290000.0,renault,106.0,1461.0,Dízel,206700.0,2010.0,scenic,0,0,...,0,1,0,1,1,0,1,1,1,1
4833,175000.0,lada,65.0,1288.0,Benzin,162000.0,1989.0,samara,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# 3
df.isna().mean().sort_values()

parsed_price         0.000000
CD-RÁDIÓ             0.000000
KLÍMA                0.000000
ALUFELNI             0.000000
ESP                  0.000000
LÉGZSÁK              0.000000
FŰTHETŐ TÜKÖR        0.000000
ASR                  0.000000
KÖDLÁMPA             0.000000
ISOFIX               0.000000
BLUETOOTH            0.000000
FŰTHETŐ ÜLÉS         0.000000
AUTOMATA             0.000000
PÓTKERÉK             0.000000
BŐR                  0.000000
SZÍNEZETT ÜVEG       0.000000
GPS                  0.000000
FŰTHETŐ SZÉLVÉDŐ     0.000000
HIFI                 0.000000
cartype              0.000000
year                 0.000000
cc                   0.000000
horsepower           0.000000
brand                0.000000
ÁLLÍTHATÓ KORMÁNY    0.000000
CENTRÁLZÁR           0.000000
fuel                 0.001543
kms                  0.007759
dtype: float64

In [7]:
# always worh remembering this when dealing with nans
_A = float("nan")
_A == _A

False

In [8]:
# 2
df.dtypes.value_counts()

int64      20
float64     5
object      3
dtype: int64

In [9]:
# 1
df.nunique()

parsed_price          5381
brand                  111
horsepower             383
cc                     843
fuel                    10
kms                  29759
year                    66
cartype               1376
HIFI                     2
FŰTHETŐ SZÉLVÉDŐ         2
GPS                      2
SZÍNEZETT ÜVEG           2
BŐR                      2
PÓTKERÉK                 2
AUTOMATA                 2
FŰTHETŐ ÜLÉS             2
BLUETOOTH                2
ISOFIX                   2
KÖDLÁMPA                 2
ASR                      2
FŰTHETŐ TÜKÖR            2
LÉGZSÁK                  2
ESP                      2
ALUFELNI                 2
KLÍMA                    2
CD-RÁDIÓ                 2
ÁLLÍTHATÓ KORMÁNY        2
CENTRÁLZÁR               2
dtype: int64

In [10]:
# 1
df.describe()

Unnamed: 0,parsed_price,horsepower,cc,kms,year,HIFI,FŰTHETŐ SZÉLVÉDŐ,GPS,SZÍNEZETT ÜVEG,BŐR,...,KÖDLÁMPA,ASR,FŰTHETŐ TÜKÖR,LÉGZSÁK,ESP,ALUFELNI,KLÍMA,CD-RÁDIÓ,ÁLLÍTHATÓ KORMÁNY,CENTRÁLZÁR
count,86863.0,86863.0,86863.0,86189.0,86863.0,86863.0,86863.0,86863.0,86863.0,86863.0,...,86863.0,86863.0,86863.0,86863.0,86863.0,86863.0,86863.0,86863.0,86863.0,86863.0
mean,3395222.0,139.193753,1887.455925,171578.4,2007.862749,0.131529,0.160114,0.167724,0.19198,0.20484,...,0.412166,0.524285,0.525609,0.531331,0.564187,0.593993,0.674464,0.705479,0.814455,0.865305
std,5322451.0,310.636856,684.939878,184575.7,6.451738,0.33798,0.366714,0.373623,0.39386,0.403587,...,0.492228,0.499413,0.499347,0.49902,0.495866,0.491089,0.468577,0.45583,0.388741,0.341399
min,1.0,5.0,1.0,1.0,1926.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,898000.0,95.0,1461.0,108823.0,2004.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
50%,1700000.0,120.0,1796.0,170000.0,2008.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,3499000.0,156.0,1998.0,228000.0,2013.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,324000000.0,89103.0,20943.0,16777220.0,2019.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Filtering Data

- [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
- [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) and [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc)
- [slicing](https://stackoverflow.com/questions/509211/understanding-slice-notation)
- [isin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas.Series.isin)
- [str](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html#pandas.Series.str) accessor

In [11]:
# 1
df.loc[:, ["brand", "horsepower"]]

Unnamed: 0,brand,horsepower
0,citroen,82.0
1,citroen,82.0
2,bmw,190.0
3,bmw,190.0
4,bmw,265.0
...,...,...
86858,opel,147.0
86859,citroen,92.0
86860,ford,116.0
86861,opel,110.0


In [12]:
df.loc[df["fuel"] == "Benzin", ["kms", "parsed_price"]]

Unnamed: 0,kms,parsed_price
0,,4355000.0
1,5.0,4710000.0
5,2600.0,26650000.0
6,5.0,5215000.0
7,5.0,6155000.0
...,...,...
86855,3000.0,315990000.0
86856,14000.0,324000000.0
86857,194000.0,1.0
86858,222000.0,68.0


In [13]:
# functions can be used in loc, and can be useful with long pipelines
# - look into lambda functions
df.loc[lambda _df: _df["cartype"].str.contains("astra"), df.dtypes == object]

Unnamed: 0,brand,fuel,cartype
29,opel,Benzin,astra_k
30,opel,Benzin,astra_k
32,opel,Benzin,astra_k
33,opel,Benzin,astra_k
34,opel,Benzin,astra_k
...,...,...,...
77311,opel,Benzin,astra_k
77961,opel,Benzin,astra_k
86857,opel,Benzin,astra_f
86858,opel,Benzin,astra_g


In [14]:
# 2
df.loc[df["brand"] == "bmw", ["parsed_price", "cartype", "brand"]].drop("brand", axis=1)

Unnamed: 0,parsed_price,cartype
2,13490000.0,x2
3,11990000.0,420
4,16990000.0,530
5,26650000.0,650
9,9990000.0,318
...,...,...
86801,50090000.0,850
86804,50390000.0,m7
86807,51190000.0,850
86811,51990000.0,m5


In [15]:
# 1
df.iloc[3, 4]

'Dízel'

In [16]:
# 1
df.iloc[3, -5]

1

In [17]:
# 1
df.iloc[500:503, 3:10]

Unnamed: 0,cc,fuel,kms,year,cartype,HIFI,FŰTHETŐ SZÉLVÉDŐ
500,1298.0,Benzin,44000.0,2002.0,swift,0,0
501,1298.0,Benzin,181000.0,1999.0,swift,0,0
502,1298.0,Benzin,160000.0,2000.0,swift,0,1


In [18]:
# 1
df.loc[2:10, "brand"]

2         bmw
3         bmw
4         bmw
5         bmw
6     citroen
7     citroen
8     citroen
9         bmw
10        kia
Name: brand, dtype: object

In [19]:
# bit more complex
df.loc[df["horsepower"].isin(df["horsepower"].nlargest(5)), df.nunique() > 2]

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype
4342,140000.0,lada,1360.0,1200.0,Benzin,150000.0,1981.0,2101
32899,1190000.0,opel,89103.0,1400.0,Benzin,80000.0,2005.0,tigra
75863,5890000.0,opel,2173.0,1598.0,Dízel,182745.0,2016.0,vivaro
84972,19880000.0,audi,993.0,3993.0,Benzin,135000.0,2013.0,rs6
86677,38990000.0,nissan,1196.0,3799.0,Benzin,40000.0,2015.0,gt-r


In [20]:
# 4
df.dropna().loc[lambda _df: _df["fuel"] == "Dízel"].sort_values(
    "horsepower", ascending=False
).head(25)

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype,HIFI,FŰTHETŐ SZÉLVÉDŐ,...,KÖDLÁMPA,ASR,FŰTHETŐ TÜKÖR,LÉGZSÁK,ESP,ALUFELNI,KLÍMA,CD-RÁDIÓ,ÁLLÍTHATÓ KORMÁNY,CENTRÁLZÁR
75863,5890000.0,opel,2173.0,1598.0,Dízel,182745.0,2016.0,vivaro,0,0,...,1,1,1,1,1,0,1,1,1,1
82192,11900000.0,bmw,560.0,4395.0,Dízel,130000.0,2012.0,m5,0,0,...,0,0,0,0,1,1,0,1,1,1
81117,9999000.0,audi,500.0,5934.0,Dízel,115666.0,2010.0,q7,0,0,...,0,1,1,0,1,1,0,1,1,1
82838,13296900.0,audi,500.0,5934.0,Dízel,79007.0,2010.0,q7,0,1,...,0,1,1,0,1,1,0,1,1,1
81129,9999000.0,ford,455.0,6700.0,Dízel,156000.0,2012.0,f_250,1,0,...,1,1,1,1,1,0,1,0,1,1
85533,22486000.0,audi,435.0,3956.0,Dízel,57000.0,2016.0,q7,0,0,...,0,1,0,0,1,1,0,1,1,1
85754,23990000.0,audi,435.0,3956.0,Dízel,58291.0,2017.0,q7,0,0,...,0,1,1,0,1,1,0,1,1,1
86027,26490000.0,audi,435.0,3956.0,Dízel,59570.0,2016.0,q7,0,0,...,0,1,1,0,1,1,0,1,1,1
85676,23490000.0,audi,435.0,3956.0,Dízel,39000.0,2017.0,q7,0,1,...,0,0,1,0,0,1,0,0,1,1
86000,26190000.0,audi,435.0,3956.0,Dízel,45900.0,2017.0,q7,0,0,...,0,1,1,0,1,1,0,1,1,1


In [21]:
# 5
df.loc[
    df["brand"].str.startswith("b") & (df[["GPS", "KLÍMA"]].sum(axis=1) > 1), :"cartype"
].dropna()

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype
67,6499000.0,bmw,190.0,1995.0,Dízel,130690.0,2014.0,520
83,6990000.0,bmw,136.0,1499.0,Benzin,26479.0,2017.0,118
109,7990000.0,bmw,150.0,1995.0,Dízel,6000.0,2016.0,218
134,8499000.0,bmw,140.0,1499.0,Benzin,6001.0,2018.0,218
149,8990000.0,bmw,190.0,1995.0,Dízel,18700.0,2018.0,320
...,...,...,...,...,...,...,...,...
86717,41390000.0,bmw,530.0,2993.0,Benzin,6100.0,2019.0,840
86731,42190000.0,bmw,530.0,4395.0,Benzin,1100.0,2019.0,850
86762,44290000.0,bmw,530.0,2993.0,Benzin,6100.0,2019.0,840
86814,52890000.0,bentley,642.0,3993.0,Benzin,18100.0,2017.0,continental


In [22]:
# 5
df.loc[df["year"] > 2005, :].loc[
    lambda _df: _df["kms"] > df["kms"].median(), ["brand", "parsed_price"]
].dropna()

Unnamed: 0,brand,parsed_price
14,volkswagen,2890000.0
15,ford,3495000.0
43,volvo,4299000.0
45,bmw,4790000.0
48,volvo,4990000.0
...,...,...
84649,mercedes-benz,18500000.0
84710,mercedes-benz,18800000.0
85001,mercedes-amg,19900000.0
85009,volvo,19900000.0


In [23]:
# 3
df.loc[:, df.nunique() > 2].sort_values("horsepower", ascending=False).head(5)

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype
32899,1190000.0,opel,89103.0,1400.0,Benzin,80000.0,2005.0,tigra
75863,5890000.0,opel,2173.0,1598.0,Dízel,182745.0,2016.0,vivaro
4342,140000.0,lada,1360.0,1200.0,Benzin,150000.0,1981.0,2101
86677,38990000.0,nissan,1196.0,3799.0,Benzin,40000.0,2015.0,gt-r
84972,19880000.0,audi,993.0,3993.0,Benzin,135000.0,2013.0,rs6


## Piped Transformations of Data

(and a little more exploring)

- pipe
- [apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) (for both DataFrame and Series)
- groupby
- assign
- pivot_table
- melt
- agg
- cut / qcut


In [24]:
# apply exists but it is generally slow and you avoid it for row-wise or series-element-wise operations

In [25]:
# 2
df["cartype"].apply(lambda s: s.replace("_", " ").title())

0                 C3
1        C3 Aircross
2                 X2
3                420
4                530
            ...     
86858        Astra G
86859             C3
86860         Mondeo
86861        Astra J
86862        Starlet
Name: cartype, Length: 86863, dtype: object

In [26]:
# 5
df.loc[
    df["brand"].pipe(lambda s: s.isin(s.value_counts().head(3).index)), :"cartype"
].dropna()

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype
2,13490000.0,bmw,190.0,1995.0,Dízel,5.0,2018.0,x2
4,16990000.0,bmw,265.0,2993.0,Dízel,3700.0,2017.0,530
5,26650000.0,bmw,449.0,4395.0,Benzin,2600.0,2017.0,650
9,9990000.0,bmw,150.0,1995.0,Dízel,5.0,2018.0,318
14,2890000.0,volkswagen,140.0,1968.0,Dízel,177300.0,2009.0,tiguan
...,...,...,...,...,...,...,...,...
86811,51990000.0,bmw,600.0,4395.0,Benzin,10100.0,2018.0,m5
86839,61690000.0,bmw,496.0,2979.0,Benzin,400.0,2017.0,m4
86857,1.0,opel,101.0,1598.0,Benzin,194000.0,2000.0,astra_f
86858,68.0,opel,147.0,2198.0,Benzin,222000.0,2003.0,astra_g


In [27]:
# 6
(
    df.assign(c=1)
    .groupby("brand")
    .agg(
        {
            "parsed_price": "mean",
            "horsepower": "median",
            "year": "min",
            "cartype": "nunique",
            "c": "sum",
            "GPS": "mean",
        }
    )
    .sort_values("c", ascending=False)
    .head(10)
)

Unnamed: 0_level_0,parsed_price,horsepower,year,cartype,c,GPS
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
volkswagen,2400272.0,110.0,1955.0,77,8323,0.164604
opel,1621438.0,101.0,1965.0,62,8076,0.096211
bmw,6283884.0,184.0,1969.0,96,7772,0.303525
ford,2362741.0,116.0,1926.0,48,7246,0.106128
mercedes-benz,6328044.0,170.0,1958.0,229,6180,0.267314
audi,6221048.0,179.0,1973.0,43,5485,0.334913
renault,1378619.0,106.0,1967.0,36,4329,0.152229
toyota,2569916.0,110.0,1981.0,41,3743,0.148544
skoda,2662310.0,105.0,1962.0,17,3157,0.128603
peugeot,1668080.0,109.0,1971.0,35,3025,0.149421


In [28]:
# 7
(
    df.assign(HP_BIN=pd.cut(df["horsepower"], [0, 60, 100, 140, 200, 300, 500, 1500]))
    .groupby("HP_BIN")
    .agg(
        {
            "parsed_price": ["min", "max", "median"],
            "brand": lambda ser: ser.value_counts().idxmax(),
            "cc": ["mean", "max"],
            "fuel": ["nunique", lambda ser: ser.value_counts().idxmin()],
        }
    )
)

Unnamed: 0_level_0,parsed_price,parsed_price,parsed_price,brand,cc,cc,fuel,fuel
Unnamed: 0_level_1,min,max,median,<lambda>,mean,max,nunique,<lambda_0>
HP_BIN,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
"(0, 60]",30000.0,9600000.0,389000.0,volkswagen,1135.702577,5599.0,5,CNG
"(60, 100]",350.0,11890000.0,950000.0,opel,1415.665369,17997.0,7,CNG
"(100, 140]",1.0,51250000.0,1599000.0,volkswagen,1746.479423,20000.0,8,Etanol
"(140, 200]",68.0,36550000.0,2799000.0,bmw,2095.199956,20943.0,10,Etanol
"(200, 300]",270000.0,49990000.0,5472500.0,bmw,2831.854791,19998.0,9,CNG
"(300, 500]",799000.0,64990000.0,13999999.0,bmw,3625.730703,7998.0,7,Etanol
"(500, 1500]",140000.0,324000000.0,27990000.0,bmw,4641.770492,8382.0,3,Dízel


In [29]:
# 8
df.groupby("fuel").apply(
    lambda df: df.groupby("brand")
    .agg({"parsed_price": "std", "horsepower": "std", "cc": "nunique"})
    .sort_values("cc", ascending=False)
    .reset_index()
    .iloc[0, :]
)

Unnamed: 0_level_0,brand,parsed_price,horsepower,cc
fuel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Benzin,ford,2686499.0,56.794172,89
Benzin/Gáz,mercedes-benz,1151447.0,68.960789,14
CNG,mercedes-benz,1911186.0,36.538647,6
Dízel,mercedes-benz,6610488.0,55.887551,59
Elektromos,opel,3818377.0,37.476659,2
Etanol,ford,271258.7,10.733126,2
Hibrid,toyota,2675782.0,41.49207,16
Hibrid (Benzin),toyota,2008481.0,27.127297,11
Hibrid (Dízel),mercedes-benz,7482852.0,12.094535,2
LPG,opel,824348.7,27.869048,7


In [30]:
# 8
(
    df.assign(XXI=lambda df: df["year"] > 1999)
    .assign(STRONG=lambda df: df["horsepower"] > df["horsepower"].median())
    .groupby(["XXI", "STRONG"])
    .apply(
        lambda df: df.assign(
            RATIO=lambda df: df["cc"].values / df["horsepower"].values
        )[["cc", "horsepower", "RATIO"]].median()
    )
    .pivot_table(index="STRONG", columns="XXI")
)

Unnamed: 0_level_0,RATIO,RATIO,cc,cc,horsepower,horsepower
XXI,False,True,False,True,False,True
STRONG,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
False,19.088235,15.583333,1587.0,1499.0,77.0,99.0
True,14.691176,12.828571,2387.0,1997.0,150.0,163.0


In [31]:
# 9
(
    df.assign(
        extras=df.loc[:, df.nunique() == 2].sum(axis=1),
        price_cat=pd.cut(df["parsed_price"] / 10 ** 6, [0.5, 1, 2.5, 7, 30]),
    )
    .groupby(["brand", "price_cat"])["extras"]
    .agg(["mean", "count"])
    .reset_index()
    .loc[lambda _df: _df["brand"].isin(df["brand"].value_counts().head(8).index)]
    .pivot_table(index="brand", columns="price_cat")
    .pipe(lambda df: df - df.mean())
)

Unnamed: 0_level_0,count,count,count,count,mean,mean,mean,mean
price_cat,"(0.5, 1.0]","(1.0, 2.5]","(2.5, 7.0]","(7.0, 30.0]","(0.5, 1.0]","(1.0, 2.5]","(2.5, 7.0]","(7.0, 30.0]"
brand,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
audi,-601.375,-805.625,140.125,669.625,0.683411,0.479337,0.313499,0.742986
bmw,-379.375,-1.625,694.125,1413.625,0.681392,0.308155,0.071912,0.429407
ford,448.625,492.375,284.125,-461.375,-0.024134,-0.265083,-0.567157,-0.906041
mercedes-benz,-493.375,-461.625,79.125,903.625,0.206845,0.100931,-0.029302,0.103392
opel,957.625,1056.375,-176.875,-755.375,-0.444595,-0.349494,-0.203868,-0.624541
renault,151.625,-717.625,-1052.875,-787.375,0.077637,0.044761,-0.058225,-0.132281
toyota,-639.375,-358.625,-606.875,-604.375,-0.870745,-0.40382,0.2398,-0.022328
volkswagen,555.625,796.375,639.125,-378.375,-0.309812,0.085213,0.23334,0.409406


In [32]:
# 8
(
    df.loc[df["brand"].isin(df["brand"].value_counts().head(20).index)]
    .groupby("brand")
    .apply(
        lambda _df: 1_000_000
        * (_df[["horsepower", "cc"]] / _df[["parsed_price"]].values).median()
    )
)

Unnamed: 0_level_0,horsepower,cc
brand,Unnamed: 1_level_1,Unnamed: 2_level_1
audi,51.089675,658.19398
bmw,57.346077,712.5
citroen,84.496124,1301.084237
fiat,100.0,1765.540891
ford,70.30303,1110.678532
honda,65.254237,963.309353
hyundai,50.574713,715.897436
kia,49.69641,695.929999
mazda,82.939387,1225.162758
mercedes-benz,56.209957,798.588056


In [33]:
# 9
(
    df.loc[
        df.loc[:, df.dtypes == float]
        .apply(lambda s: (s < s.quantile(0.95)) & (s >= 0))
        .all(axis=1)
        & df["brand"].isin(df["brand"].value_counts().head(10).index),
        :,
    ]
    .groupby("brand")
    .apply(
        lambda gdf: gdf[["year", "parsed_price", "kms", "cc"]]
        .corr()
        .unstack()
        .loc[lambda s: s.index.get_level_values(0) > s.index.get_level_values(1)]
    )
)

Unnamed: 0_level_0,year,year,year,parsed_price,parsed_price,kms
Unnamed: 0_level_1,parsed_price,kms,cc,kms,cc,cc
brand,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
audi,0.803029,-0.579585,0.122372,-0.676267,0.24767,0.140876
bmw,0.747074,-0.575142,-0.262947,-0.732438,-0.205673,0.311646
ford,0.777505,-0.321184,0.061015,-0.438965,0.289453,0.267097
mercedes-benz,0.689827,-0.372575,0.041856,-0.553653,0.253802,0.216523
opel,0.791465,-0.429441,0.050421,-0.513854,0.164731,0.255525
peugeot,0.810289,-0.365238,0.04355,-0.407607,0.197152,0.273378
renault,0.790212,-0.4199,-0.038678,-0.505316,0.046316,0.294715
skoda,0.734384,-0.313859,0.213336,-0.513528,0.369412,0.205511
toyota,0.74764,-0.497284,0.054031,-0.511506,0.343482,0.214234
volkswagen,0.711432,-0.357272,0.089714,-0.467986,0.308519,0.251282


In [34]:
# 7
(
    df.loc[(df["parsed_price"] < 15_000_000) & (df["year"] > 2000), :]
    .assign(base_type=df["cartype"].str.split("_").str[0], c=1)
    .groupby(["brand", "base_type"])
    .agg(
        {
            "fuel": "nunique",
            "cartype": "nunique",
            "parsed_price": "std",
            "c": "sum",
            "horsepower": "mean",
        }
    )
    .loc[lambda _df: _df["c"] >= 500, :]
    .sort_values("parsed_price", ascending=False)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,fuel,cartype,parsed_price,c,horsepower
brand,base_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
mercedes-benz,c,7,16,3809832.0,844,164.57346
audi,a6,4,2,3639698.0,1105,209.370136
mercedes-benz,e,7,16,3584664.0,958,191.209812
mercedes-benz,a,3,11,3470192.0,690,112.171014
audi,a4,3,3,2705634.0,1208,156.351821
bmw,320,3,2,2580014.0,1023,166.164223
audi,a3,5,2,2155293.0,673,129.451709
ford,mondeo,6,1,2134440.0,1138,139.096661
volkswagen,passat,5,9,2058075.0,1771,141.686053
skoda,octavia,4,1,1908430.0,1570,121.350318


In [35]:
# 9
df.groupby(pd.cut(df["year"], [1990, 2000, 2010, 2020])).apply(
    lambda gdf: pd.Series(
        {
            "efficiency": (gdf["horsepower"] / gdf["cc"]).median() * 1000,
            "price_rate": gdf.dropna(subset=["kms"])
            .groupby(gdf["kms"] > gdf["kms"].median())["parsed_price"]
            .median()
            .pipe(lambda s: s[True] / s[False]),
        }
    )
)

Unnamed: 0_level_0,efficiency,price_rate
year,Unnamed: 1_level_1,Unnamed: 2_level_1
"(1990, 2000]",56.320401,1.246753
"(2000, 2010]",66.345441,0.923018
"(2010, 2020]",78.080229,0.447619


In [36]:
# 7
df.loc[(df["year"] < 2015) & (df["year"] > 1990), :].assign(
    kmpy=lambda _df: (df["kms"] / (2020 - _df["year"])).round(-3),
    kmpbin=lambda _df: pd.qcut(_df["kmpy"], 5),
).groupby("kmpbin")[["year", "cc", "horsepower", "parsed_price"]].median()

Unnamed: 0_level_0,year,cc,horsepower,parsed_price
kmpbin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-0.001, 10000.0]",2004.0,1493.0,94.0,949000.0
"(10000.0, 13000.0]",2005.0,1598.0,109.0,1150000.0
"(13000.0, 16000.0]",2006.0,1896.0,116.0,1350000.0
"(16000.0, 20000.0]",2008.0,1968.0,131.0,1690000.0
"(20000.0, 1118000.0]",2011.0,1968.0,140.0,2400000.0


In [37]:
# 4
df.groupby("cartype")[["cc", "parsed_price"]].transform("mean")

Unnamed: 0,cc,parsed_price
0,1314.481481,1.494045e+06
1,1299.615385,5.046231e+06
2,1956.280000,1.400830e+07
3,1995.651685,9.800442e+06
4,2908.927928,6.707052e+06
...,...,...
86858,1548.294118,6.791712e+05
86859,1314.481481,1.494045e+06
86860,1906.092809,2.302770e+06
86861,1561.512111,2.594038e+06


In [38]:
# 4
df.loc[df.groupby("brand")["cartype"].transform("count") > 5_000, :].sample(30).iloc[
    :, :8
]

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype
8004,339000.0,opel,54.0,973.0,Benzin,183526.0,1998.0,corsa
20310,750000.0,volkswagen,90.0,1896.0,Dízel,242000.0,1998.0,passat_v
51942,2050000.0,audi,105.0,1896.0,Dízel,278000.0,2006.0,a3
16416,650000.0,bmw,116.0,1796.0,Benzin,211304.0,2001.0,316
70657,4099000.0,audi,120.0,1968.0,Dízel,190000.0,2013.0,a4
12625,519000.0,opel,101.0,1994.0,Dízel,367000.0,2004.0,vectra_c
23211,849000.0,volkswagen,105.0,1595.0,Benzin,229555.0,2002.0,golf_iv
38130,1390000.0,mercedes-benz,306.0,4966.0,Benzin,246000.0,1999.0,s_500
7395,230000.0,audi,125.0,1781.0,Benzin,220000.0,1996.0,a3
30429,1098000.0,volkswagen,102.0,1595.0,Benzin,229000.0,2004.0,touran


In [39]:
# 6
(
    df.groupby(["brand", "fuel"])["parsed_price"]
    .agg(["median", "count"])
    .loc[lambda _df: _df["count"] > 300]
    .reset_index()
    .pivot_table(index="brand", columns="fuel", values="median")
    .dropna()
    .assign(diff=lambda _df: _df["Benzin"] - _df["Dízel"])
    .sort_values("diff")
    .pipe(lambda _df: _df / 10 ** 6)
    .round(2)
)

fuel,Benzin,Dízel,diff
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
volvo,1.6,3.0,-1.4
bmw,2.35,3.69,-1.34
skoda,1.48,2.6,-1.12
mitsubishi,1.2,2.0,-0.8
volkswagen,1.2,1.95,-0.75
mercedes-benz,2.65,3.3,-0.65
citroen,0.95,1.5,-0.55
nissan,1.99,2.49,-0.5
peugeot,0.99,1.48,-0.49
ford,1.3,1.75,-0.45


In [40]:
# 7
(
    df.loc[
        (df.groupby(["brand", "fuel"])["cartype"].transform("count") > 500)
        & df["parsed_price"].pipe(lambda s: s < s.quantile(0.9))
        & (df["year"] > 1990)
    ]
    .groupby(["brand", "fuel"])
    .apply(lambda gdf: gdf[["parsed_price", "year"]].corr().iloc[0, 1])
    .reset_index()
    .pivot_table(index="brand", columns="fuel", values=0)
    .dropna()
    .assign(diff=lambda _df: _df["Benzin"] - _df["Dízel"])
    .sort_values("diff")
    .round(2)
)

fuel,Benzin,Dízel,diff
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mercedes-benz,0.61,0.77,-0.16
bmw,0.68,0.81,-0.13
honda,0.81,0.9,-0.09
mazda,0.81,0.88,-0.08
audi,0.82,0.85,-0.02
peugeot,0.83,0.84,-0.01
seat,0.84,0.83,0.01
renault,0.84,0.82,0.02
citroen,0.88,0.85,0.03
volkswagen,0.83,0.8,0.03


In [41]:
# 6
(
    df.assign(extras=df.loc[:, "HIFI":].sum(axis=1)).loc[
        lambda _df: _df[["parsed_price", "cc", "horsepower", "kms", "extras"]]
        .apply(lambda s: s >= s.quantile(0.8))
        .all(axis=1),
        :"BŐR",
    ]
)

Unnamed: 0,parsed_price,brand,horsepower,cc,fuel,kms,year,cartype,HIFI,FŰTHETŐ SZÉLVÉDŐ,GPS,SZÍNEZETT ÜVEG,BŐR
3391,7500000.0,mercedes-benz,252.0,2987.0,Dízel,246050.0,2013.0,cls_350,1,0,1,0,1
70934,4190000.0,audi,239.0,2967.0,Dízel,267814.0,2008.0,q7,0,0,1,0,0
70962,4190000.0,jeep,177.0,2777.0,Dízel,271900.0,2007.0,wrangler,0,0,1,0,0
70983,4190000.0,mercedes-benz,231.0,2987.0,Dízel,285092.0,2011.0,e_300,0,0,1,0,0
70988,4190000.0,mercedes-benz,224.0,2987.0,Dízel,423000.0,2007.0,gl_320,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
79431,8299999.0,toyota,286.0,4461.0,Dízel,338060.0,2008.0,land_cruiser,1,0,1,0,1
79634,8490000.0,mercedes-benz,252.0,2987.0,Dízel,268000.0,2014.0,e_350,0,0,1,0,1
80526,9490000.0,mercedes-benz,387.0,5461.0,Benzin,250634.0,2009.0,cl_500,1,0,1,0,1
82245,12190000.0,audi,500.0,6299.0,Benzin,280223.0,2011.0,a8,0,0,0,0,1


In [42]:
# 9
(
    df.assign(c=1)
    .groupby(["brand", "year"])
    .agg({"parsed_price": "median", "c": "sum"})
    .loc[lambda _df: _df["c"] > 100]
    .groupby("brand")
    .apply(
        lambda gdf: gdf.sort_values("year").assign(
            pdiff=lambda _gdf: _gdf["parsed_price"].diff()
            / _gdf["parsed_price"].median(),
            ydiff=lambda _gdf: _gdf.reset_index()["year"].diff().values,
        )
    )
    .loc[lambda _df: _df["ydiff"] == 1]
    .pivot_table(columns="brand", index="year", values="pdiff")
    .loc[:, lambda _df: _df.notna().sum() > 10]
)

brand,audi,bmw,citroen,ford,mercedes-benz,opel,peugeot,renault,skoda,toyota,volkswagen
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1997.0,,,,,,,,,,,-0.005761
1998.0,,,,,,,,,,,0.038781
1999.0,,,,,-0.015813,0.088519,,,,,0.048848
2000.0,0.014007,0.036462,,0.058858,0.024203,0.022963,,,,,0.024257
2001.0,0.038169,0.01769,,0.058858,0.03695,0.042593,,0.020545,,,0.048514
2002.0,0.035192,0.018411,,0.02884,0.067931,0.082593,0.065891,0.071014,,,0.054579
2003.0,0.069859,0.06047,0.0,0.041789,0.096813,0.037037,0.024031,0.062975,,0.117016,0.066101
2004.0,0.066708,0.132671,0.114704,0.069158,0.070996,0.074815,0.077519,0.133095,0.021518,0.069464,0.167374
2005.0,0.091219,0.113357,0.076982,0.077987,0.008068,0.11111,0.116279,0.089326,0.114764,0.055944,0.075197
2006.0,0.105051,0.050542,0.076597,0.046498,0.130052,0.111112,0.097674,0.089326,0.042558,0.075058,0.105822


In [43]:
# 8
df.groupby(["brand", "year"]).agg(
    {
        "cc": "median",
        "horsepower": "median",
        "parsed_price": "median",
        "cartype": "count",
    }
).sort_values("cartype").tail(200).rank(ascending=False).assign(
    rdiff=lambda _df: _df["parsed_price"] - _df["horsepower"]
).sort_values(
    "rdiff", ascending=False
).head(
    10
)

Unnamed: 0_level_0,Unnamed: 1_level_0,cc,horsepower,parsed_price,cartype,rdiff
brand,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
mercedes-benz,2000.0,15.5,59.5,173.0,173.5,113.5
bmw,2001.0,9.0,44.5,154.0,105.0,109.5
mercedes-benz,2001.0,15.5,59.5,164.0,139.0,104.5
bmw,2000.0,50.0,52.0,156.0,176.0,104.0
bmw,2002.0,43.5,52.0,146.5,71.0,94.5
audi,2000.0,77.0,84.5,170.5,189.5,86.0
bmw,2003.0,43.5,52.0,137.0,84.5,85.0
audi,2001.0,77.0,73.5,158.0,194.5,84.5
mercedes-benz,2002.0,15.5,58.0,141.5,103.5,83.5
mercedes-benz,2003.0,15.5,44.5,119.5,79.0,75.0


In [44]:
# 8
df.assign(lsize=df["cc"].round(-3)).groupby(["lsize"]).agg(
    main_type=pd.NamedAgg("cartype", pd.Series.mode),
    top_pow=pd.NamedAgg("horsepower", lambda s: s.quantile(0.9)),
    hifi_rate=pd.NamedAgg("HIFI", "mean"),
    count=pd.NamedAgg("brand", "count")
).loc[lambda _df: _df["count"] > 500]

Unnamed: 0_level_0,main_type,top_pow,hifi_rate,count
lsize,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000.0,swift,120.0,0.096183,25046
2000.0,focus,179.0,0.146218,51225
3000.0,a6,326.0,0.144583,8445
4000.0,a8,570.0,0.138205,1259
5000.0,s_500,557.0,0.150235,639
