### Wczytanie zbiór danych Diet_R.csv

In [18]:
import pandas as pd

data = pd.read_csv("data/Diet_R.csv", na_values=["", " ", "?", "NA", "NaN", None])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Person        78 non-null     int64  
 1   gender        76 non-null     float64
 2   Age           78 non-null     int64  
 3   Height        78 non-null     int64  
 4   pre.weight    78 non-null     int64  
 5   Diet          78 non-null     int64  
 6   weight6weeks  78 non-null     float64
dtypes: float64(2), int64(5)
memory usage: 4.4 KB


### Wykrywanie obserwacji odstających metodą rozstępu międzykwartylowego (IQR)

In [19]:
def detect_outliers(df):
    outlier_counts = {}
    for col in df.select_dtypes(include=[float,int]).columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outlier_counts[col] = outliers.shape[0]
    print("Outlier counts per column:")
    print(outlier_counts)

detect_outliers(data)

Outlier counts per column:
{'Person': 0, 'gender': 0, 'Age': 0, 'Height': 8, 'pre.weight': 1, 'Diet': 0, 'weight6weeks': 1}


### Wykrywanie braków danych

In [20]:
missing_values = data.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 Person          0
gender          2
Age             0
Height          0
pre.weight      0
Diet            0
weight6weeks    0
dtype: int64


### Zastąpnienie brakujących wartości średnią kolumny

In [21]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
data['gender'] = imputer.fit_transform(data[['gender']])


Dla wczytanego zbioru policzyć podstawowe statystyki:
* średnia, mediana, odchylenie standardowe, mediana, 1 i 3 kwartyl
* statystki policzyć dla zbioru jako całości i z podziałem na płeć
* wyniki zapisać do pliku

In [22]:
avg = data.mean()
median = data.median()
std_dev = data.std()
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)

stats = pd.DataFrame({
    'Mean': avg,
    'Median': median,
    'StdDev': std_dev,
    'Q1': q1,
    'Q3': q3
})

stats.to_csv('data/wyniki.csv')
stats

Unnamed: 0,Mean,Median,StdDev,Q1,Q3
Person,39.5,39.5,22.660538,20.25,58.75
gender,0.434211,0.0,0.492424,0.0,1.0
Age,39.153846,39.0,9.815277,32.25,46.75
Height,170.820513,169.5,11.276621,164.25,174.75
pre.weight,72.525641,72.0,8.723344,66.0,78.0
Diet,2.038462,2.0,0.81292,1.0,3.0
weight6weeks,68.680769,68.95,8.924504,61.85,73.825
