Week 3

Exercise:

Join kaggle.com and choose a real-world dataset. (This is part of your homework, try to do some research on Kaggle.)

Download the data of your choice.

Practice data cleaning and preprocessing, handle missing values, outliers etc.

Explore the basic statistics with pandas methods.

Create visualizations to understand the distribution of variables.

Identify correlations between variables using correlation matrices and/or heatmaps.

Derive insights from your analysis. What interesting patterns or trends did you discover?

Notes:

The goal is to gain insights into the data and present your findings through meaningful visualizations.

Document your analysis and include code comments to explain each step of the analysis.

Create visualizations with clear labels and titles.

Summarize your findings in a informative manner.

In [1]:
#define the library

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt



# Gathering Data

In [2]:
dataset_url = "C:/Users/avcil/projects/data-science/homeworks/data/housing_price_dataset.csv"

In [3]:
housing_price_dataset = pd.read_csv(dataset_url)

In [11]:
housing_price_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SquareFeet    50000 non-null  int64  
 1   Bedrooms      50000 non-null  int64  
 2   Bathrooms     50000 non-null  int64  
 3   Neighborhood  50000 non-null  object 
 4   YearBuilt     50000 non-null  int64  
 5   Price         50000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 2.3+ MB


In [21]:
housing_price_dataset.head(10)

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
0,2126,4,1,Rural,1969,215355.283618
1,2459,3,2,Rural,1980,195014.221626
2,1860,2,1,Suburb,1970,306891.012076
3,2294,2,1,Urban,1996,206786.787153
4,2130,5,2,Suburb,2001,272436.239065
5,2095,2,3,Suburb,2020,198208.803907
6,2724,2,1,Suburb,1993,343429.31911
7,2044,4,3,Rural,1957,184992.321268
8,2638,4,3,Urban,1959,377998.588152
9,1121,5,2,Urban,2004,95961.926014


data tiplerini inceledik ve gördük ki sadece neighborhood sutunu object türünde.
urban = şehirli
suburb = kenar mahallede yaşayan
rural = köylü, kırsal bölgede yaşayan

In [12]:
print(f'Shape     : {housing_price_dataset.shape}\n'
      f'Size      : {housing_price_dataset.size}\n'
      f'Dimension : {housing_price_dataset.ndim}')


Shape     : (50000, 6)
Size      : 300000
Dimension : 2


In [13]:
#bos veri var mı diye kontrol ettim
housing_price_dataset.isnull().sum()

SquareFeet      0
Bedrooms        0
Bathrooms       0
Neighborhood    0
YearBuilt       0
Price           0
dtype: int64

In [15]:
housing_price_dataset.duplicated().sum()
#yinelenen veri kontrolü

0

In [16]:
#sutun isimlerine eriştik 
housing_price_dataset.columns


Index(['SquareFeet', 'Bedrooms', 'Bathrooms', 'Neighborhood', 'YearBuilt',
       'Price'],
      dtype='object')

# Exploratory Data Analysis & Preprocessing The Data

In [17]:
#sayısal sütunlar için hesaplanan temel istatistiklerin bir tablosu
#Bu tablo, her sütunun belirli bir istatistiksel özelliği temsil ettiği bir yapıya sahip
# Bu tür bir istatistiksel özet, veri analizi ve keşif aşamalarında verilerin anlaşılmasına yardımcı olabilir
housing_price_dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SquareFeet,50000.0,2006.37468,575.513241,1000.0,1513.0,2007.0,2506.0,2999.0
Bedrooms,50000.0,3.4987,1.116326,2.0,3.0,3.0,4.0,5.0
Bathrooms,50000.0,1.99542,0.815851,1.0,1.0,2.0,3.0,3.0
YearBuilt,50000.0,1985.40442,20.719377,1950.0,1967.0,1985.0,2003.0,2021.0
Price,50000.0,224827.325151,76141.842966,-36588.165397,169955.860225,225052.141166,279373.630052,492195.259972


In [24]:
#pric, yearbuilt,squarefeet, can not be 0 or negative. Convert them to nan
housing_price_dataset.loc[housing_price_dataset["Price"] <= 0 , "Price"] = np.nan
housing_price_dataset.loc[housing_price_dataset["SquareFeet"] <= 0 , "SquareFeet"] = np.nan
housing_price_dataset.loc[housing_price_dataset["YearBuilt"] <= 0 , "YearBuilt"] = np.nan

In [26]:
# We can add house age instead of using year built
housing_price_dataset["HouseAge"] = 2023 - housing_price_dataset["YearBuilt"]


In [27]:
#check the dataset
housing_price_dataset

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price,HouseAge
0,2126,4,1,Rural,1969,215355.283618,54
1,2459,3,2,Rural,1980,195014.221626,43
2,1860,2,1,Suburb,1970,306891.012076,53
3,2294,2,1,Urban,1996,206786.787153,27
4,2130,5,2,Suburb,2001,272436.239065,22
...,...,...,...,...,...,...,...
49995,1282,5,3,Rural,1975,100080.865895,48
49996,2854,2,2,Suburb,1988,374507.656727,35
49997,2979,5,3,Suburb,1962,384110.555590,61
49998,2596,5,2,Rural,1984,380512.685957,39


In [29]:
# And delete the year built
housing_price_dataset = housing_price_dataset.drop(["YearBuilt"], axis = 1)

In [30]:
#check again
housing_price_dataset.head()

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,Price,HouseAge
0,2126,4,1,Rural,215355.283618,54
1,2459,3,2,Rural,195014.221626,43
2,1860,2,1,Suburb,306891.012076,53
3,2294,2,1,Urban,206786.787153,27
4,2130,5,2,Suburb,272436.239065,22


# Data Visualization

buradan düzelt



In [None]:
# Suplots of numeric features v price
sns.set_style('darkgrid')
f, axes = plt.subplots(4,2, figsize = (20,30))

# Plot [0,0]
axes[0,0].scatter(x = 'Rooms', y = 'Price', data = data, edgecolor = 'b')
axes[0,0].set_xlabel('Rooms')
axes[0,0].set_ylabel('Price')
axes[0,0].set_title('Rooms v Price')

# Plot [0,1]
axes[0,1].scatter(x = 'Distance', y = 'Price', data = data, edgecolor = 'b')
axes[0,1].set_xlabel('Distance')
# axes[0,1].set_ylabel('Price')
axes[0,1].set_title('Distance v Price')

# Plot [1,0]
axes[1,0].scatter(x = 'Bathroom', y = 'Price', data = data, edgecolor = 'b')
axes[1,0].set_xlabel('Bathroom')
axes[1,0].set_ylabel('Price')
axes[1,0].set_title('Bathroom v Price')
# Plot [1,1]
axes[1,1].scatter(x = 'Car', y = 'Price', data = data, edgecolor = 'b')
axes[1,0].set_xlabel('Car')
axes[1,1].set_ylabel('Price')
axes[1,1].set_title('Car v Price')

# Plot [2,0]
axes[2,0].scatter(x = 'Landsize', y = 'Price', data = data, edgecolor = 'b')
axes[2,0].set_xlabel('Landsize')
axes[2,0].set_ylabel('Price')
axes[2,0].set_title('Landsize v  Price')

# Plot [2,1]
axes[2,1].scatter(x = 'BuildingArea', y = 'Price', data = data, edgecolor = 'b')
axes[2,1].set_xlabel('BuildingArea')
axes[2,1].set_ylabel('BuildingArea')
axes[2,1].set_title('BuildingArea v Price')

# Plot [3,0]
axes[3,0].scatter(x = 'Age', y = 'Price', data = data, edgecolor = 'b')
axes[3,0].set_xlabel('Age')
axes[3,0].set_ylabel('Price')
axes[3,0].set_ylabel('Age v Price')
# Plot [3,1]
axes[3,1].scatter(x = 'Propertycount', y = 'Price', data = data, edgecolor = 'b')
axes[3,1].set_xlabel('Propertycount')
#axes[3,1].set_ylabel('Price')
axes[3,1].set_title('Property Count v Price')

plt.show()

In [None]:
sns.pairplot(data)

In [None]:
# Plot each numerical attribute
melbourne.hist(figsize=(15, 10))
plt.show()

In [None]:
# Correlation Matrix
# Correlation Inspection
plt.figure(figsize=(16,10))
sns.heatmap(data.corr(), annot= True)
plt.title('Correlation')




# other way 
#correlation matrix
f, ax = plt.subplots(figsize=(5, 5))
corrmat = melbourne.corr()
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()
corrmat

In [None]:
#aykırı  değer saptama
sns.boxplot(x=melbourne["Landsize"])
plt.show()


sns.boxplot(x=melbourne["BuildingArea"])
plt.show()

# Price boxplot
plt.figure(figsize=(8,8))
sns.boxplot(y="Price", data=melbourne)


# Landsize boxplot
plt.figure(figsize=(8,8))
sns.boxplot(y="Landsize", data=melbourne)


# BuildingArea boxplot
plt.figure(figsize=(8,8))
sns.boxplot(y="BuildingArea", data=melbourne)

In [None]:

prices = melbourne['Price']

# Histogram
plt.hist(prices, bins=30, color='skyblue', edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price')
plt.show()


In [None]:

# Tüm sayısal değişkenleri seçtik (Fiyat sütununu hariç)
num_cols = melbourne.select_dtypes(include=['number']).drop(columns=['Price'])

# Tüm sayısal değişkenlerle Fiyat arasındaki ilişki için
sns.set(style='ticks')
sns.pairplot(data=melbourne, x_vars=num_cols.columns, y_vars=['Price'])
plt.show()



* kategorik değişken içinde pasta grafiği yap 
* gruplandırıp da inceleyebilirsin