## Verilerin istatistiksel olarak incelenmesi ve aykırı değerlerin ayıklanması

Gerekli paket ve modüllerin yüklenmesi

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Verilerin yüklenmesi

In [2]:
df = pd.read_csv('data.csv')

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8136 entries, 0 to 8135
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   city          8136 non-null   object
 1   district      8136 non-null   object
 2   neighborhood  8136 non-null   object
 3   room          8136 non-null   int64 
 4   living_room   8136 non-null   int64 
 5   area          8136 non-null   int64 
 6   age           8136 non-null   int64 
 7   floor         8136 non-null   int64 
 8   price         8136 non-null   int64 
dtypes: int64(6), object(3)
memory usage: 572.2+ KB
None


In [4]:
df['city'] = df['city'].astype('category')
df['district'] = df['district'].astype('category')
df['neighborhood'] = df['neighborhood'].astype('category')
df['room'] = df['room'].astype('int')
df['living_room'] = df['living_room'].astype('int')
df['area'] = df['area'].astype('int')
df['age'] = df['age'].astype('int')
df['floor'] = df['floor'].astype('int')
df['price'] = df['price'].astype('int')

In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8136 entries, 0 to 8135
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   city          8136 non-null   category
 1   district      8136 non-null   category
 2   neighborhood  8136 non-null   category
 3   room          8136 non-null   int64   
 4   living_room   8136 non-null   int64   
 5   area          8136 non-null   int64   
 6   age           8136 non-null   int64   
 7   floor         8136 non-null   int64   
 8   price         8136 non-null   int64   
dtypes: category(3), int64(6)
memory usage: 438.2 KB
None


Nümerik değişkenlerin minimum, maximum ve çeyreklik değerlerinin bulunması

In [6]:
columns = df.select_dtypes(include=[np.number]).columns
min_values = []
max_values = []
for column in columns:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    min_value = Q1 - 1.5 * IQR
    max_value = Q3 + 1.5 * IQR
    min_values.append(min_value)
    max_values.append(max_value)
    print(f"Column: {column}, min: {min_value}, max: {max_value}")

Column: room, min: 0.5, max: 4.5
Column: living_room, min: 1.0, max: 1.0
Column: area, min: -17.5, max: 242.5
Column: age, min: -20.0, max: 44.0
Column: floor, min: -2.0, max: 6.0
Column: price, min: -18000.0, max: 62000.0


Aykırı değerlerin temizlenmesi

In [13]:
for i, column in enumerate(columns):
    df = df[(df[column] >= min_values[i]) & (df[column] <= max_values[i])]

In [14]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 6212 entries, 0 to 8134
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   city          6212 non-null   category
 1   district      6212 non-null   category
 2   neighborhood  6212 non-null   category
 3   room          6212 non-null   int64   
 4   living_room   6212 non-null   int64   
 5   area          6212 non-null   int64   
 6   age           6212 non-null   int64   
 7   floor         6212 non-null   int64   
 8   price         6212 non-null   int64   
dtypes: category(3), int64(6)
memory usage: 388.9 KB
None


In [15]:
print(df.describe())

              room  living_room         area          age        floor  \
count  6212.000000       6212.0  6212.000000  6212.000000  6212.000000   
mean      2.176272          1.0   104.669350    12.653896     2.199614   
std       0.826815          0.0    39.442494    10.451565     1.589618   
min       1.000000          1.0     5.000000     0.000000    -2.000000   
25%       2.000000          1.0    75.000000     4.000000     1.000000   
50%       2.000000          1.0   100.000000    10.000000     2.000000   
75%       3.000000          1.0   130.000000    20.000000     3.000000   
max       4.000000          1.0   240.000000    44.000000     6.000000   

              price  
count   6212.000000  
mean   17900.975853  
std    10467.582893  
min        1.000000  
25%    11000.000000  
50%    15000.000000  
75%    21000.000000  
max    60000.000000  


Kira fiyatı için elle düzeltme

In [16]:
df = df[df['price'] >= 3000]

In [17]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 6116 entries, 23 to 8134
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   city          6116 non-null   category
 1   district      6116 non-null   category
 2   neighborhood  6116 non-null   category
 3   room          6116 non-null   int64   
 4   living_room   6116 non-null   int64   
 5   area          6116 non-null   int64   
 6   age           6116 non-null   int64   
 7   floor         6116 non-null   int64   
 8   price         6116 non-null   int64   
dtypes: category(3), int64(6)
memory usage: 383.3 KB
None


In [19]:
print(df.describe())

              room  living_room         area          age        floor  \
count  6116.000000       6116.0  6116.000000  6116.000000  6116.000000   
mean      2.180020          1.0   104.830445    12.698169     2.198496   
std       0.826463          0.0    39.467687    10.465384     1.589161   
min       1.000000          1.0     5.000000     0.000000    -2.000000   
25%       2.000000          1.0    75.000000     4.000000     1.000000   
50%       2.000000          1.0   100.000000    10.000000     2.000000   
75%       3.000000          1.0   130.000000    20.000000     3.000000   
max       4.000000          1.0   240.000000    44.000000     6.000000   

              price  
count   6116.000000  
mean   18170.733976  
std    10323.229150  
min     3000.000000  
25%    11500.000000  
50%    15000.000000  
75%    21000.000000  
max    60000.000000  


In [20]:
df.to_csv('data_cleaned.csv', index=False)