# Exploring sales advertising in the housing market in Saint Pietersburg

The data set for this research is the Yandex.Realty archive of ads for the sale of apartments in St. Petersburg and neighboring settlements. 

The main goal of this research is to understand what affects the realty costs. The outcome of the research will be used to build an automated system that will track anomalies and fraudulent activity.

There are two types of data available for each apartment for sale. The first type is the data that was entered by the user, the second type is the data that obtained automatically based on cartography. For example, the distance to the center, airport, nearest park and so on.

The table contains the following data:

* 'airports_nearest` — distance to the nearest airport in meters (m)
* 'balcony` — number of balconies
* 'ceiling_height` — ceiling height (m)
* 'cityCenters_nearest` - distance to the city center (m)
* 'days_exposition` — how many days the ad was placed (from publication to removal)
* 'first_day_exposition` — date of publication
* 'floor` - floor
* 'floors_total` — total floors in the house
* 'is_apartment` - apartments (boolean type)
* 'kitchen_area` — kitchen area in square meters (m2)
* 'last_price` - price at the time of withdrawal from publication
* 'living_area` — living area in square meters(m2)
* 'locality_name` — name of the locality
* 'open_plan` — free layout (boolean type)
* 'parks_around3000` — the number of parks within a 3 km radius
* 'parks_nearest` — distance to the nearest park (m)
* 'ponds_around3000` — number of reservoirs within a radius of 3 km
* 'ponds_nearest` — distance to the nearest reservoir (m)
* 'rooms` — number of rooms
* 'studio` — studio apartment (boolean type)
* 'total_area` — the area of the apartment in square meters (m2)
* 'total_images` — the number of photos of the apartment in the ad

The research includes the following stages:

1. Data investigation.
2. Data cleanup.
3. Data analysis.
4. Follow up.

# Data investigation

First thing to do is to load tha data and check it's shape.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
# libraries import
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
!pip install pymystem3==0.1.10
from pymystem3 import Mystem
m = Mystem()
from matplotlib import pyplot as plt




In [10]:
# data import
df = pd.read_csv("/content/drive/MyDrive/da_portfolio/real_estate_data.csv", sep='\t') 
display(df.head(10))
display(df.info())

Unnamed: 0,total_images,last_price,total_area,first_day_exposition,rooms,ceiling_height,floors_total,living_area,floor,is_apartment,studio,open_plan,kitchen_area,balcony,locality_name,airports_nearest,cityCenters_nearest,parks_around3000,parks_nearest,ponds_around3000,ponds_nearest,days_exposition
0,20,13000000.0,108.0,2019-03-07T00:00:00,3,2.7,16.0,51.0,8,,False,False,25.0,,Санкт-Петербург,18863.0,16028.0,1.0,482.0,2.0,755.0,
1,7,3350000.0,40.4,2018-12-04T00:00:00,1,,11.0,18.6,1,,False,False,11.0,2.0,посёлок Шушары,12817.0,18603.0,0.0,,0.0,,81.0
2,10,5196000.0,56.0,2015-08-20T00:00:00,2,,5.0,34.3,4,,False,False,8.3,0.0,Санкт-Петербург,21741.0,13933.0,1.0,90.0,2.0,574.0,558.0
3,0,64900000.0,159.0,2015-07-24T00:00:00,3,,14.0,,9,,False,False,,0.0,Санкт-Петербург,28098.0,6800.0,2.0,84.0,3.0,234.0,424.0
4,2,10000000.0,100.0,2018-06-19T00:00:00,2,3.03,14.0,32.0,13,,False,False,41.0,,Санкт-Петербург,31856.0,8098.0,2.0,112.0,1.0,48.0,121.0
5,10,2890000.0,30.4,2018-09-10T00:00:00,1,,12.0,14.4,5,,False,False,9.1,,городской посёлок Янино-1,,,,,,,55.0
6,6,3700000.0,37.3,2017-11-02T00:00:00,1,,26.0,10.6,6,,False,False,14.4,1.0,посёлок Парголово,52996.0,19143.0,0.0,,0.0,,155.0
7,5,7915000.0,71.6,2019-04-18T00:00:00,2,,24.0,,22,,False,False,18.9,2.0,Санкт-Петербург,23982.0,11634.0,0.0,,0.0,,
8,20,2900000.0,33.16,2018-05-23T00:00:00,1,,27.0,15.43,26,,False,False,8.81,,посёлок Мурино,,,,,,,189.0
9,18,5400000.0,61.0,2017-02-26T00:00:00,3,2.5,9.0,43.6,7,,False,False,6.5,2.0,Санкт-Петербург,50898.0,15008.0,0.0,,0.0,,289.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23699 entries, 0 to 23698
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   total_images          23699 non-null  int64  
 1   last_price            23699 non-null  float64
 2   total_area            23699 non-null  float64
 3   first_day_exposition  23699 non-null  object 
 4   rooms                 23699 non-null  int64  
 5   ceiling_height        14504 non-null  float64
 6   floors_total          23613 non-null  float64
 7   living_area           21796 non-null  float64
 8   floor                 23699 non-null  int64  
 9   is_apartment          2775 non-null   object 
 10  studio                23699 non-null  bool   
 11  open_plan             23699 non-null  bool   
 12  kitchen_area          21421 non-null  float64
 13  balcony               12180 non-null  float64
 14  locality_name         23650 non-null  object 
 15  airports_nearest   

None

Now the data is loaded and I can see that not all data can be used for analysis the way it is now: the number of rows in columns is not equal, some cells have `NaN` as a value, also there is a need to check the data for duplicates. 





# Data cleanup

I will start with checking for duplicates.

In [13]:
# converting ;ocality to the lowercase
df["locality_name"] = df["locality_name"].str.lower()

# checking for duplicates in the wholde dataframe
print("The number of duplicates is:", df.duplicated().sum())

The number of duplicates is: 0


Well, now I know that there are no duplicates, so I can move on with my data cleanup. The next step is to decide what to do with unequal number of rows in columns.  

In general unequal number of rows means, that some cells are not filled with the porper data. I will check the amount of such cells per row and will also check if everything is fine with the columns' names.

In [14]:
# checking the columns' names
df.columns

Index(['total_images', 'last_price', 'total_area', 'first_day_exposition',
       'rooms', 'ceiling_height', 'floors_total', 'living_area', 'floor',
       'is_apartment', 'studio', 'open_plan', 'kitchen_area', 'balcony',
       'locality_name', 'airports_nearest', 'cityCenters_nearest',
       'parks_around3000', 'parks_nearest', 'ponds_around3000',
       'ponds_nearest', 'days_exposition'],
      dtype='object')

No spaces, no camelcase. All fine.

In [15]:
# checking the amount of  null in the column
df.isnull().sum()

total_images                0
last_price                  0
total_area                  0
first_day_exposition        0
rooms                       0
ceiling_height           9195
floors_total               86
living_area              1903
floor                       0
is_apartment            20924
studio                      0
open_plan                   0
kitchen_area             2278
balcony                 11519
locality_name              49
airports_nearest         5542
cityCenters_nearest      5519
parks_around3000         5518
parks_nearest           15620
ponds_around3000         5518
ponds_nearest           14589
days_exposition          3181
dtype: int64