we'll work with a dataset of used cars from *eBay Kleinanzeigen*, a classifieds section of the German eBay website

The dataset was originally scraped and uploaded to **Kaggle**. We've made a few modifications from the original dataset that was uploaded to Kaggle:

* We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
* We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:

* dateCrawled - When this ad was first crawled. All field-values are taken from this date.
* name - Name of the car.
* seller - Whether the seller is private or a dealer.
* offerType - The type of listing
* price - The price on the ad to sell the car.
* abtest - Whether the listing is included in an A/B test.
* vehicleType - The vehicle Type.
* yearOfRegistration - The year in which the car was first registered.
* gearbox - The transmission type.
* powerPS - The power of the car in PS.
* model - The car model name.
* kilometer - How many kilometers the car has driven.
* monthOfRegistration - The month in which the car was first registered.
* fuelType - What type of fuel the car uses.
* brand - The brand of the car.
* notRepairedDamage - If the car has a damage which is not yet repaired.
* dateCreated - The date on which the eBay listing was created.
* nrOfPictures - The number of pictures in the ad.
* postalCode - The postal code for the location of the vehicle.
* lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings. You'll also become familiar with some of the unique benefits jupyter notebook provides for pandas.

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:
import numpy as np
import pandas as pd
print('libraries imported!')

libraries imported!


In [2]:
# import csv file
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

We have several columns which contain numbers but it's data type is string. For example 'price' 'odometer'

some of the columns also miss the values which will look closely.

* The dataset contains 20 columns, most of which are strings.
* Some columns have null values, but none have more than ~20% null values.
* The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

------

I will convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive

In [4]:
autos.rename(
        {'yearOfRegistration': 'registration_year',
        'monthOfRegistration': 'registration_month',
        'notRepairedDamage': 'unrepaired_damage',
        'dateCreated': 'ad_created',
        'vehicleType': 'vehicle_type',
        'lastSeen': 'last_seen',
        'postalCode': 'posta_code',
        'nrOfPictures': 'pictures_number',
        'fuelType': 'fuel_type',
        'powerPS': 'power_ps',
        'offerType': 'offer_type',
        'dateCrawled': 'date_crawled'}, axis=1, inplace=True
        )
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,pictures_number,posta_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


-------
We'll do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.

In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,pictures_number,posta_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Couple of things to note:
* price is stored as a string and needs correction
* same with a odometer, we have to remove km at store it as a int
* pictures_number contains just 0 values and can be dropped
* 'seller' and 'offer_type' has 49999 same value out of 50000 so most likely we'll drop them too

-----
We start by organizing 'price' and 'odometer'

In [6]:
print(autos.price.head())
print(autos.odometer.head())

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object
0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object


In [7]:
# we remove non-numberi characters
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')

# convert the column to a numberic type
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].astype(int)

# renaming odometer column
autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)

In [8]:
autos.head(3)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,pictures_number,posta_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the 'odometer_km' and 'price' columns. Here's the steps we'll take:

In [9]:
# to see how many unique values
print('number of unique values in price', autos['price'].unique().shape[0])
print('number of unique values in price', autos['odometer_km'].unique().shape[0])

number of unique values in price 2357
number of unique values in price 13


In [10]:
# to view min/max/median/mean etc
autos[['price', 'odometer_km']].describe()

Unnamed: 0,price,odometer_km
count,50000.0,50000.0
mean,9840.044,125732.7
std,481104.4,40042.211706
min,0.0,5000.0
25%,1100.0,125000.0
50%,2950.0,150000.0
75%,7200.0,150000.0
max,100000000.0,150000.0


In [11]:
# value counts
autos['price'].value_counts().sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [12]:
# value counts
autos['odometer_km'].value_counts().sort_index(ascending=True).head(20)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

In [13]:
# dropping outlires
autos = autos[autos['price'] < 3990000]

I dropped values with:
* 12345678 which seemed randomly entered number
* also 999999999 which also seems randomly entered number and is a huge outlier

--------
Let's now move on to the date columns and understand the date range the data covers.

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website


Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [14]:
autos[['date_crawled','ad_created','last_seen']].head()

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [15]:
autos[['date_crawled','ad_created','last_seen']].describe()

Unnamed: 0,date_crawled,ad_created,last_seen
count,49992,49992,49992
unique,48206,76,39477
top,2016-03-12 16:06:22,2016-04-03 00:00:00,2016-04-07 06:17:27
freq,3,1946,8


You'll notice that the first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use `.value_counts()` to generate a distribution, and then sort by the index.

In [16]:
print(autos['date_crawled'].head())

0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object


In [17]:
autos['date_crawled_nt'] = (autos['date_crawled'].str.split().str[0])

In [18]:
autos['date_crawled_nt'].value_counts(
    normalize=True, dropna=False).sort_index()

2016-03-05    0.025384
2016-03-06    0.013942
2016-03-07    0.035966
2016-03-08    0.033265
2016-03-09    0.033205
2016-03-10    0.032125
2016-03-11    0.032485
2016-03-12    0.036766
2016-03-13    0.015562
2016-03-14    0.036626
2016-03-15    0.033985
2016-03-16    0.029505
2016-03-17    0.031525
2016-03-18    0.013062
2016-03-19    0.034906
2016-03-20    0.037826
2016-03-21    0.037506
2016-03-22    0.032925
2016-03-23    0.032385
2016-03-24    0.029105
2016-03-25    0.031745
2016-03-26    0.032485
2016-03-27    0.031045
2016-03-28    0.034846
2016-03-29    0.034165
2016-03-30    0.033625
2016-03-31    0.031905
2016-04-01    0.033805
2016-04-02    0.035406
2016-04-03    0.038686
2016-04-04    0.036526
2016-04-05    0.013102
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled_nt, dtype: float64

Now we describe `registration_year`

In [19]:
autos['registration_year'].describe()

count    49992.000000
mean      2005.074552
std        105.720930
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

One thing that stands out from the exploration we did in the last screen is that the `registration_year` column contains some odd values:

* The minimum value is 1000, before cars were invented
* The maximum value is 9999, many years into the future

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2020 interval and see if it's safe to remove those rows entirely, or if we need more custom logic

In [20]:
print('number of released years outside of normal range is: ',
      autos[(autos['registration_year'] < 1900) &
            (autos['registration_year'] > 2020)].shape[1]
     )

number of released years outside of normal range is:  21


So cars with weird `released_year` has only 21 values so it's safe to say that we call drop them

In [21]:
autos = autos[autos['registration_year'].between(1900, 2020)]

In [22]:
autos['registration_year'].describe()

count    49968.000000
mean      2003.367835
std          7.690122
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64

In [23]:
autos['registration_year'].value_counts(normalize=True)

2000    0.067123
2005    0.060339
1999    0.060018
2004    0.054775
2003    0.054575
2006    0.054195
2001    0.054075
2002    0.050692
1998    0.049091
2007    0.046110
2008    0.044649
2009    0.041987
1997    0.040586
2011    0.032701
2010    0.031960
2017    0.029059
1996    0.028898
2012    0.026477
2016    0.026337
1995    0.026257
2013    0.016130
2014    0.013309
1994    0.013208
2018    0.009826
1993    0.008906
2015    0.007985
1990    0.007905
1992    0.007825
1991    0.007125
1989    0.003622
          ...   
1966    0.000440
1975    0.000380
1969    0.000380
1965    0.000340
1964    0.000240
1963    0.000180
1910    0.000180
1959    0.000140
1961    0.000120
1956    0.000100
1958    0.000080
1937    0.000080
1962    0.000080
1950    0.000060
2019    0.000060
1955    0.000040
1954    0.000040
1951    0.000040
1957    0.000040
1941    0.000040
1934    0.000040
1953    0.000020
1948    0.000020
1927    0.000020
1931    0.000020
1943    0.000020
1929    0.000020
1938    0.0000

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the `brand` column.

In [24]:
autos['brand'].value_counts(normalize=True, ascending=True)

lada              0.000620
lancia            0.001141
rover             0.001381
trabant           0.001541
jaguar            0.001541
daewoo            0.001581
saab              0.001581
land_rover        0.001981
subaru            0.002161
jeep              0.002181
daihatsu          0.002562
dacia             0.002582
chrysler          0.003622
chevrolet         0.005664
suzuki            0.005864
porsche           0.005884
alfa_romeo        0.006584
kia               0.007125
honda             0.007985
mitsubishi        0.008085
mini              0.008485
volvo             0.009126
hyundai           0.009766
sonstige_autos    0.010827
toyota            0.012348
citroen           0.013989
smart             0.014029
nissan            0.015090
mazda             0.015150
skoda             0.015710
seat              0.018812
fiat              0.026157
peugeot           0.029139
renault           0.048111
ford              0.069605
audi              0.085715
mercedes_benz     0.094681
b

In [46]:
top_prices = {}
unique_cars = autos['brand'].unique()
    
for c in unique_cars:
    mean_price = autos.loc[autos['brand'] == c, 'price'].mean()
    top_prices[c] = mean_price

In [51]:
top_prices

{'alfa_romeo': 3943.562310030395,
 'audi': 8965.560354891431,
 'bmw': 8254.43938835667,
 'chevrolet': 6432.929328621908,
 'chrysler': 3286.0552486187844,
 'citroen': 3680.8440629470674,
 'dacia': 5897.736434108527,
 'daewoo': 1038.3544303797469,
 'daihatsu': 1552.09375,
 'fiat': 2697.6771231828616,
 'ford': 3913.0215641173086,
 'honda': 3889.8596491228072,
 'hyundai': 5316.754098360656,
 'jaguar': 11076.506493506493,
 'jeep': 11377.550458715596,
 'kia': 5707.3258426966295,
 'lada': 2476.9032258064517,
 'lancia': 3070.3508771929824,
 'land_rover': 18934.272727272728,
 'mazda': 3962.542932628798,
 'mercedes_benz': 8380.637920101459,
 'mini': 10392.393867924528,
 'mitsubishi': 3328.227722772277,
 'nissan': 4588.879310344828,
 'opel': 2842.8246289169874,
 'peugeot': 3010.8688186813188,
 'porsche': 44537.97959183674,
 'renault': 2351.301996672213,
 'rover': 1494.5217391304348,
 'saab': 3183.493670886076,
 'seat': 4223.654255319149,
 'skoda': 6313.076433121019,
 'smart': 3482.971469329529,
 

In the last screen, we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

* Audi, BMW and Mercedes Benz are more expensive
* Ford and Opel are less expensive
* Volkswagen is in between

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:

* it's difficult to compare more than two aggregate series objects if we want to extend to more columns
* we can't compare more than a few rows from each series object
* we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly.

In [52]:
top_miles = {}
unique_cars = autos['brand'].unique()
    
for c in unique_cars:
    mean_miles = autos.loc[autos['brand'] == c, 'odometer_km'].mean()
    top_miles[c] = mean_miles

In [53]:
top_miles

{'alfa_romeo': 131109.4224924012,
 'audi': 129643.9411627364,
 'bmw': 132544.21518054532,
 'chevrolet': 99522.96819787986,
 'chrysler': 133149.17127071825,
 'citroen': 120042.91845493563,
 'dacia': 84728.68217054264,
 'daewoo': 121708.86075949368,
 'daihatsu': 114843.75,
 'fiat': 117012.24177505738,
 'ford': 124153.24899367453,
 'honda': 123709.27318295739,
 'hyundai': 106782.7868852459,
 'jaguar': 121298.7012987013,
 'jeep': 127522.93577981651,
 'kia': 112640.44943820225,
 'lada': 86774.19354838709,
 'lancia': 123157.8947368421,
 'land_rover': 118333.33333333333,
 'mazda': 125132.10039630119,
 'mercedes_benz': 130933.20651025153,
 'mini': 89375.0,
 'mitsubishi': 126893.56435643564,
 'nissan': 118978.7798408488,
 'opel': 129361.37071651091,
 'peugeot': 127352.33516483517,
 'porsche': 97363.94557823129,
 'renault': 128223.79367720465,
 'rover': 136449.27536231885,
 'saab': 143670.88607594935,
 'seat': 122186.17021276595,
 'skoda': 111082.8025477707,
 'smart': 100756.06276747503,
 'sonst

In [54]:
cars_series_price = pd.Series(top_prices)
cars_series_miles = pd.Series(top_miles)

alfa_romeo         3943.562310
audi               8965.560355
bmw                8254.439388
chevrolet          6432.929329
chrysler           3286.055249
citroen            3680.844063
dacia              5897.736434
daewoo             1038.354430
daihatsu           1552.093750
fiat               2697.677123
ford               3913.021564
honda              3889.859649
hyundai            5316.754098
jaguar            11076.506494
jeep              11377.550459
kia                5707.325843
lada               2476.903226
lancia             3070.350877
land_rover        18934.272727
mazda              3962.542933
mercedes_benz      8380.637920
mini              10392.393868
mitsubishi         3328.227723
nissan             4588.879310
opel               2842.824629
peugeot            3010.868819
porsche           44537.979592
renault            2351.301997
rover              1494.521739
saab               3183.493671
seat               4223.654255
skoda              6313.076433
smart   

In [64]:
brand_prices = pd.DataFrame(cars_series_price, columns=['mean_price'])
brand_prices[:5]

Unnamed: 0,mean_price
alfa_romeo,3943.56231
audi,8965.560355
bmw,8254.439388
chevrolet,6432.929329
chrysler,3286.055249


In [65]:
brand_miles = pd.DataFrame(cars_series_miles, columns=['mean_miles'])
brand_miles[:5]

Unnamed: 0,mean_miles
alfa_romeo,131109.422492
audi,129643.941163
bmw,132544.215181
chevrolet,99522.968198
chrysler,133149.171271


In [69]:
brand_prices_miles = pd.concat([brand_prices, brand_miles], axis=1)
brand_prices_miles

Unnamed: 0,mean_price,mean_miles
alfa_romeo,3943.56231,131109.422492
audi,8965.560355,129643.941163
bmw,8254.439388,132544.215181
chevrolet,6432.929329,99522.968198
chrysler,3286.055249,133149.171271
citroen,3680.844063,120042.918455
dacia,5897.736434,84728.682171
daewoo,1038.35443,121708.860759
daihatsu,1552.09375,114843.75
fiat,2697.677123,117012.241775
