

### TASKS
1. Data cleaning and preparation
2. Explorative analysis
    - Visuals -> hist, plot, heatmap mby?
    - Descriptive statistics -> mean, median, std deviation, freq counts 
    - Figuring out which measures are important
    - Recognizing patternd and special groups etc.
    - Few hypotheses -> relations, group behaviour etc.
3. Statistical Analysis
    - Do testing -> T-test, chi-square, ANOVA jne.
    - Confidence intervals and Estimation about parameter(s) that best represent the population
    - Regression and modeling
    - Hyphothese and its testing
4. Must do a statistical model

### Reviewing criteria
1. Data preparation DONE
2. Use of descriptive statistics -> Working ON
3. Use of estimation and statistical test -> NEXT UP
4. Argumentation for design choices -> Working ON 
5. interpretation of results

### Tasks from the hypothetical scheme
1. Characterise the individuals that are present in the data. Are there groups of similar persons?
    - Distributions for all categorical values DONE
    - Find out if certain age, sex, and municipality groups have similiar activities ALMOST DONE
2. Estimate how much time on average households spend daily on each activity.
    - try out mean and medians for activities
3. With respect to which activities do men and women differ?
4. With respect to which activities do living environments differ?
5. Which activities are associated with each other?

## Use of AI

1. Explanatory summaries of topics where I quite not understand what is the meaning of a certain statistical test where lecture slides did not answer arising questions. 
2. I have GitHub Copilot purchased by my employer. In the project I only used the code suggestions (works similiarly to snippets), that helps with repetative code and faster typing. I did NOT use the generative prompt tool. I have also found in my previous work that prompting often causes a lot of work in form of rewriting code, so I tend not to use it anyway.  
3. Some occasions I asked explanations from ChatGPT about ie. pd.dataframe syntax, since it is sometimes a bit confusing.

# Project work

In [1487]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sc
plt.style.use('ggplot') # found from online tutorial
# import seaborn as sns



In [1488]:
# Loading data
df = pd.read_csv('./habits.data',
                sep=";",
                na_values=["?"],
                index_col=False,
                header=0)

### Data types

#### Demographic variables
- kohde - household ID: nominal
- jasen - member ID (within household): nominal
- pvknro - day of week: categorical, nominal and binary
- sp - sex: categorical, nominal and binary
- IKAL1 - age group: categorical, ordinal
- ASALUE - living environment: categorical, nominal


#### Activity variables
Time spent on activities (measured in minutes): Quantitative, ratio
- V1 - working: ratio
- V5 - cooking: ratio
- V21 - childcare: ratio
- V22 - reading and playing with children: ratio

Place visited in past 12 months 
- Values: Categorical, indicator, binary
    - 1 = yes
    - 2 = no

- H1a_A - cinema: indicator
- H1b_A - theater: indicator


## 1. Data preparation

At first glance NaN-values will be dropped, but may change over the study if effect on results seem good either way. 

At first I thought of filling zero values with mean or medians but they would heavily influence the results pushing up correlation and other metrics. It seems that particioners have deliberatley chosen 0 minutes and therefore they are not active on these metrics at all. This must be included in the results.

### Importing, renaming, description

In [1489]:
# Keep relevant columns
columns_to_keep = ['kohde', 
                   'jasen', 
                   'pvknro', 
                   'sp', 
                   'IKAL1', 
                   'ASALUE', 
                   'V1', 
                   'V21', 
                   'V22', 
                   'V5', 
                   'H1a_A', 
                   'H1b_A']
df = df[columns_to_keep].copy()
# df

In [1490]:
# Renaming columns
df.rename(columns={'kohde': 'household_id', 
                   'jasen': 'member_id', 
                   'pvknro': 'day_of_week', 
                   'sp': 'sex', 
                   'IKAL1': 'age_group', 
                   'ASALUE': 'area', 
                   'V1': 't_working', 
                   'V21': 't_cooking', 
                   'V22': 't_childcare', 
                   'V5': 't_activity_w_child', 
                   'H1a_A': 'visited_cinema', 
                   'H1b_A': 'visited_theatre'
                   }, inplace=True)

In [None]:
print(df.info(), "\n\n")
print(df.describe().round(2), "\n\n")
print(df.isna().sum(),"\n\n")
print(df.nunique(), "\n\n")

#### Notes from initial data check

- 338 unique household IDs
- Other categorical values have all categories stated in data info
- In non-categoric values there is lots of NaN values. 
- Not taking NaNs in to account the data uniqueness seems ok. 
- At least the lower 25% has 0 minutes in all activities
- Dtypes are not correct


### Cleaning erroneous values

In [None]:
# df[['visited_cinema', 'visited_theatre']].value_counts() --> Shows bad values

# 'Visited_Cinema' and 'Visited_Theatre' has NaN values and other erroneous values
# Change errenoeus to NaN and then delete all rows containing NaN values

df['visited_cinema'] = df['visited_cinema'].apply(lambda x: 'Yes' if x == 1.0 else 'No' if x == 2.0 else np.nan)
df['visited_theatre'] = df['visited_theatre'].apply(lambda x: 'Yes' if x == 1.0 else 'No' if x == 2.0 else np.nan)
df.dropna(subset=['visited_cinema', 'visited_theatre'], inplace=True)

print(df.isna().sum())


In [1493]:
columns_to_clean = ['t_working','t_cooking', 't_childcare', 't_activity_w_child']
for column in columns_to_clean:
    df[column] = df[column].apply(
        lambda x: pd.to_numeric(x, errors='coerce')  # Convert to numeric, set invalid to NaN
    )
    df[column] = df[column].fillna(0).astype(int)  # Replace NaN with 0 before conversion

After handling bad quality data, there is 23 households with that have answered only for weekend or weekday. Dropping them so all data has corresponding data for weekend and weekday.

In [None]:
# Testing if there are more than 2 answers per household
one = (df['household_id'].value_counts().copy() == 1).sum()
two = (df['household_id'].value_counts().copy() == 2).sum()
over_two = (df['household_id'].value_counts().copy() > 2).sum()

print("One: {}, Two: {}, Over Two: {}".format(one, two, over_two))

occurrences = df['household_id'].value_counts().copy()
household_ids = occurrences[occurrences == 1].index

# Dropping households with only 1 answer
df = df[~df['household_id'].isin(household_ids)].copy()
print("Lines deleted")

### Retyping data

In [None]:
# Retyping columns
df['household_id'] = df['household_id'].astype('int64')                 # Quantitative, discrete
df['member_id'] = pd.Categorical(df['member_id'])                       # Categorical, binary
df['day_of_week'] = pd.Categorical(df['day_of_week'])                   # Categorical, binary
df['sex'] = pd.Categorical(df['sex'])                                   # Categorical, binary
df['age_group'] = pd.Categorical(df['age_group'], ordered=True)         # Categorical, ordinal 
df['area'] = pd.Categorical(df['area'])                                 # Categorical, nominal

# Categorical, nominal, Int values measuring minutes
df['t_working'] = df['t_working'].astype('int64')
df['t_cooking'] = df['t_cooking'].astype('int64')
df['t_childcare'] = df['t_childcare'].astype('int64')
df['t_activity_w_child'] = df['t_activity_w_child'].astype('int64')

# Categorical, binary, indicator
df['visited_cinema'] = pd.Categorical(df['visited_cinema'])
df['visited_theatre'] = pd.Categorical(df['visited_theatre'])

df.dtypes


# Deleting NaN values or replacing with 0 -> if replaced, may introduce bias towards not going. 


#### Values to human readable format

In [None]:
df['day_of_week'] = df['day_of_week'].replace({1: 'weekday', 2: 'weekend'})
df['sex'] = df['sex'].replace({1: 'male', 2: 'female'})
df['area'] = df['area'].replace({1: 'city', 2: 'municipality', 3: 'rural'})
df['age_group'] = df['age_group'].replace({
    1: "10-14",
    2: "15-19",
    3: "20-24",
    4: "25-34",
    5: "35-44",
    6: "45-54",
    7: "55-64",
    8: "65-74",
    9: "75+"
})
# Warnings of deprecated function but works

In [None]:
# Checking ID values for anomalies
print('ID max:', df.household_id.max())
print('ID min:', df.household_id.min())
print("Unique ID's:", df.household_id.nunique())

# Looks OK
# Few household_ids are missing since we deleted some rows and did not re_index++# Time spent in activities
# Check weekend and weekday separately


## 2. Explorative analysis

    - Visuals -> hist, plot, heatmap mby?
    - Descriptive statistics -> mean, median, std deviation, freq counts 
    - Figuring out which measures are important
    - Recognizing patternd and special groups etc.
    - Few hypotheses -> relations, group behaviour etc.

In [None]:
# If I want to see only males -> select Day_of_Week[Workday OR Weekday] AND Sex[Male OR Female]
female_workday = df[(df['day_of_week'] == 'workday') & (df['sex'] == 'female')].sort_values('age_group')
female_weekend = df[(df['day_of_week'] == 'weekend') & (df['sex'] == 'female')].sort_values('age_group')
male_workday = df[(df['day_of_week'] == 'workday') & (df['sex'] == 'male')].sort_values('age_group')
male_weekend = df[(df['day_of_week'] == 'weekend') & (df['sex'] == 'male')].sort_values('age_group')

male_weekend['t_working'].plot.hist(bins=20, xlabel='Minutes')
male_workday['t_working'].plot.hist(bins=20, xlabel='Minutes')
female_weekend['t_working'].plot.hist(bins=20, xlabel='Minutes')
female_workday['t_working'].plot.hist(bins=20, xlabel='Minutes')

### Questions that arises?

- Group differences
    - Working
    - Not working
    - Different age groups
    - Men and Women

- Member ID seems quite irreleveant, unless there is significant behaviour change in them

### 2.1 Visuals and plotting

In [None]:
df.groupby('age_group', observed=False)[['visited_cinema', 'visited_theatre']].value_counts().unstack('age_group').plot.area(stacked=True)

# Older people visit less culture events, especially less cinema but prefer only theartre or both but not only cinema.

In [None]:
df.groupby('day_of_week', observed=False)[['visited_cinema', 'visited_theatre']].value_counts().unstack('day_of_week').plot.bar(stacked=True)


In [None]:
# Chekcing rows where no time spent on any activity
time_columns = ['t_working', 't_cooking', 't_childcare', 't_activity_w_child']
time_df = df[time_columns]
ids = df.loc[time_df[time_columns].sum(axis=1) == 0].household_id.value_counts()
print("Sum of id's with 0 minutes in all 4 fields:", len(ids))

ids = ids[ids > 1].index
print("Sum of id's where both weekend and weekday activities 0 minutes:", len(ids))
ids

df_no_activity = df[(df['household_id'].isin(ids)) & (df['visited_cinema'] == 'No') & (df['visited_theatre'] == 'No') ]
len(df_no_activity) / 2 # Amount of individuals that do none of the activities

In [None]:
# Trying to create subplots with a loop
#  Age distributions in relation to sex, area are interesting 
demographics = ['member_id', 'area', 'sex', 'age_group']

columns = len(demographics)
rows = int(columns / 2)
fig, axes = plt.subplots(nrows=rows, ncols=2, figsize=(10, 5 * rows))
axes = axes.flatten()

# Previous method axes.flatten() creates a list to iterate over with a single digit
# Subplots for each column with demogrpaphic types
for i, column in enumerate(demographics):
    df.groupby([column], observed=False)['household_id'].nunique().plot.bar(ax=axes[i], fontsize=14)
    axes[i].set_title(f'Unique Household IDs by {column}')
    axes[i].set_ylabel('Count of Unique Household IDs')
    axes[i].set_xlabel(column)

plt.tight_layout()
plt.show()

# Density plots ???

# df[df['sex'] == 'female'].groupby(['age_group', 'area']).size().unstack().plot(xlabel='female', kind='bar', ax=axes[1], stacked=True)


In [None]:
# Creating subplots individually
# Calculated in respect to individual households ~336

# Mean values for men and women living in different areas
df_male = df[df['sex'] == 'male'][['household_id','age_group','area']].groupby(['age_group','area'], observed=True).nunique()
df_female = df[df['sex'] == 'female'][['household_id','age_group','area']].groupby(['age_group', 'area'], observed=True).nunique()

fig, axes = plt.subplots(nrows=2, 
                         ncols=2, 
                         figsize=(10, 10))

df_male.unstack('area').plot(kind='bar', ax=axes[0,0])
df_male.unstack('area').plot(kind='density', ax=axes[1,0])
df_female.unstack('area').plot(kind='bar', ax=axes[0,1])
df_female.unstack('area').plot(kind='density', ax=axes[1,1])




Observed = True shows: 
- that there is no younger age_groups in Male population in areas of municipality and rural
- And female population in municipality also lacks answerers

In [None]:
fig, axes = plt.subplots(nrows=1, 
                         ncols=2, 
                         sharey=True, 
                         sharex=True, 
                         figsize=(12, 6),
                         )

# For area duplicates are not needed
unique_df = df.drop_duplicates(subset='household_id', keep='first')
unique_df

unique_df[unique_df['sex'] == 'male'].groupby(['age_group', 'area'], observed=True).size().unstack('area').plot(kind='line', ax=axes[0], fontsize=14, linewidth=2)
unique_df[unique_df['sex'] == 'female'].groupby(['age_group', 'area'], observed=True).size().unstack('area').plot(kind='line', ax=axes[1], fontsize=14, linewidth=2)

fig.suptitle('Line plots for area of living')
axes[0].set(xlabel='Male')
axes[1].set(xlabel='Female')

In [None]:
sex_sums = df[['sex']].value_counts()
ratio = sex_sums.female.sum() / sex_sums.male.sum()
ratio
# There is {ratio} times more female than men overall

### 2.2 Numerical variables measuring time
Mean median std deviation and frequencies



#### Means ans deviations without grouping

In [None]:
list = ['t_working', 't_cooking', 't_childcare', 't_activity_w_child']
# print(df['day_of_week' == 'Weekday'][list].std())
weekday_mean = df[df['day_of_week'] == 'weekday'][list].mean()
workday_mean = df[df['day_of_week'] == 'weekend'][list].mean()

weekday_std = df[df['day_of_week'] == 'weekday'][list].std()
workday_std = df[df['day_of_week'] == 'weekend'][list].std()

pd.concat([weekday_mean, workday_mean, weekday_std, workday_std], axis=1, keys=['Weekday Mean', 'Weekend Mean', 'Weekday Std', 'Weekend Std']).round(2)



#### Grouped by sex

In [None]:
list = ['sex','t_working', 't_cooking', 't_childcare', 't_activity_w_child']
weekday_mean = df[df['day_of_week'] == 'weekday'][list].groupby('sex', observed=False).mean()
pd.concat([weekday_mean], axis=1, keys=['Weekday Mean']).round(2)

In [None]:
list = ['sex','t_working', 't_cooking', 't_childcare', 't_activity_w_child']
weekend_mean = df[df['day_of_week'] == 'weekend'][list].groupby('sex', observed=False).mean()
pd.concat([weekend_mean], axis=1, keys=['Weekend Mean']).round(2)

#### Grouped by age_group

In [None]:
list = ['age_group','t_working', 't_cooking', 't_childcare', 't_activity_w_child']
weekday_mean = df[df['day_of_week'] == 'weekday'][list].groupby('age_group', observed=False).mean()
pd.concat([weekday_mean], axis=1, keys=['Weekday Mean']).round(2)

In [None]:
weekend_mean = df[df['day_of_week'] == 'weekend'][list].groupby('age_group', observed=False).mean()
pd.concat([weekend_mean], axis=1, keys=['Weekend Mean']).round(2)

#### Grouped by area

In [None]:
list = ['area','t_working', 't_cooking', 't_childcare', 't_activity_w_child']
weekday_mean = df[df['day_of_week'] == 'weekday'][list].groupby('area', observed=False).mean()
pd.concat([weekday_mean], axis=1, keys=['Weekday Mean']).round(2)

In [None]:
weekend_mean = df[df['day_of_week'] == 'weekend'][list].groupby('area', observed=False).mean()
pd.concat([weekend_mean], axis=1, keys=['Weekend Mean']).round(2)

- T_working, std deviation indicates high difference in working time values because mean is only less than 2 hours

- Same gous through out the data. It seems time spent on these activities are quite low on average with respectively high deviation, that indicates that most of whom spend time on these activities spend significantly more than the average and many people do not spend time at all or very little

- I there is groups that are employed and unemployed. Groups that have childern or grandchildren or live at home, therefore spending time on childcare and act_w_child. 

- Means and deviations with groups age, sex and municipality should be done to observe more.

### 2.3 Categorical variables characteristics

In [None]:
# Living area distribution
counts = df['area'].value_counts().copy()

women_count = (df.drop_duplicates(subset='household_id')['sex'] == 'female').sum()
print(women_count)
area_sex_counts = df.groupby(['area', 'sex'], observed=False).size().unstack().copy()
axis3 = (area_sex_counts / 2).plot(kind='bar', title='Distribution of Men and Women in Living Area')
axis3.set_xticklabels(axis3.get_xticklabels(), rotation=0)

In [1514]:
# Filter the dataframe to include only rows where Day_of_Week is 1
df_workday = df[df['day_of_week'] == 2].copy()

# Convert to numeric
df_scatter_workday = df_workday.apply(pd.to_numeric, errors='coerce')

# Select the columns to normalize (columns 5 to 9)
columns_to_normalize = df_scatter_workday.columns[5:10]

# Apply the rank method to normalize to percentiles
df_scatter_workday[columns_to_normalize] = df_scatter_workday[columns_to_normalize].rank(pct=True)

# Ensure 'Visited_Theatre' is of type int64
df_scatter_workday['visited_theatre'] = df_scatter_workday['visited_theatre'].astype('int64')

# Plot the scatter matrix
# scatter_matrix = pd.plotting.scatter_matrix(df_scatter_workday.iloc[:, 4:10], figsize=(10, 10))


In [1515]:
# Plot distributions of categorical variables
categorical_columns = ['day_of_week', 'sex', 'age_group', 'area']  # Day of week, Sex, Age, Area

### 2.4 Recognizing groups by activity

#### Answering to questions
3. With respect to which activities do men and women differ?
4. With respect to which activities do living environments differ?

Employed and unemployed
People who go to theatre and cinema tend to live in citys
Cook a lot -> lives somewhere?

Employment percentage in citys and other areas

#### Employed and unemployed dataframes

In [None]:
employed = df[df['t_working'] > 0]
unemployed = df[df['t_working'] == 0]
employed.head()


# working = numerical['t_working']
# cooking = numerical['t_cooking']
# childcare = numerical['t_childcare']
# w_child = numerical['t_activity_w_child']


In [None]:
employed['visited_cinema'] = employed['visited_cinema'].map({'Yes': 1, 'No': 0}).astype(bool)
employed['visited_theatre'] = employed['visited_theatre'].map({'Yes': 1, 'No': 0}).astype(bool)
unemployed['visited_cinema'] = unemployed['visited_cinema'].map({'Yes': 1, 'No': 0}).astype(bool)
unemployed['visited_theatre'] = unemployed['visited_theatre'].map({'Yes': 1, 'No': 0}).astype(bool)

#### 2.4.1 Visited theatre and cinema

##### Visited activities //  group 'age'

In [None]:
employed.groupby('age_group', observed=False)[['visited_theatre', 'visited_cinema']].mean()

In [None]:
unemployed.groupby('age_group', observed=False)[['visited_theatre', 'visited_cinema']].mean()

##### Visited activities // group 'sex'

In [None]:
employed.groupby('sex', observed=False)[['visited_theatre', 'visited_cinema']].mean()

In [None]:
unemployed.groupby('sex', observed=False)[['visited_theatre', 'visited_cinema']].mean()

#### 2.4.2 Time spent on activities 

##### Time spent // group 'sex'

In [None]:
# employed.groupby('age_group').value_counts()['t_working']
# Group by 'age_group' in the employed DataFrame
employed.groupby('sex', observed=False)[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']].mean().round(3)


In [None]:
unemployed.groupby('sex', observed=False)[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']].mean().round(3)


##### Time spent // group 'age'

In [None]:
employed.groupby('age_group', observed=False)[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']].mean().round(3)

In [None]:
unemployed.groupby('age_group', observed=False)[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']].median().round(3)

##### Time spent // group 'area'

In [None]:
employed.groupby('area', observed=False)[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']].mean().round(3)

#### 2.4.3 People who are working all week

In [None]:
df_w = df[df['t_working'] > 0]
id_counts = df_w.groupby('household_id', observed=False).size()
id_two_entries = id_counts[id_counts == 2].index

df_w[df_w['household_id'].isin(id_two_entries)].copy()

In [None]:
df_wc = df[df['t_working'] > 0]
df_wc[df_wc['t_cooking'] > 0]

In [None]:
df_wc = df[df['t_working'] == 0]
df_wc[df_wc['t_cooking'] > 0]

In [None]:
# Calculate median for each activity, excluding zero values
household_medians_nonzero = df[df['t_working'] > 0].groupby('household_id')['t_working'].median().median(), \
                            df[df['t_cooking'] > 0].groupby('household_id')['t_cooking'].median().median(), \
                            df[df['t_childcare'] > 0].groupby('household_id')['t_childcare'].median().median(), \
                            df[df['t_activity_w_child'] > 0].groupby('household_id')['t_activity_w_child'].median().median()

print("Median time spent (excluding zero values):\n", household_medians_nonzero)


#### By weekday and weekend

In [1531]:
df_weekday = df[df['day_of_week'] == 'weekday']
df_weekend = df[df['day_of_week'] == 'weekend']

#### People who are not working

In [None]:
df_wc = df_weekday[df_weekday['t_working'] == 0]
df_wc_ck = df_wc[df_wc['t_cooking'] == 0]
df_wc.describe().round(2)

### 3. Patterns and hypothesis

1. Characterise the individuals that are present in the data. Are there groups of similar persons?
    Trivial groups -> Age, Area, sex
    Groups via activity -> how to know??

    Not so trivial groups -> employed, unemployed, cook alot?, with children, visitin culture, not visiting culture

2. Estimate how much time on average households spend daily on each activity.
    Mean values + medians, -> Task: calculate medians

5. Which activities are associated with each other?
    Task: calculate Correlations with activities

## 4. Statistical Analysis

1. Kruskal-Wallis test for numerival variables t_activities

2. Pearsons Chi-squared test for multiple categorical values -> value_counts as occurrences

1. Difference in groups
    1. Unemployed - employed
    1. Working only 
2. Find out if activities attract certain people
3. 

In [None]:
df_weekday_time = df_weekday[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']]
df_weekend_time = df_weekend[['t_working', 't_cooking', 't_childcare', 't_activity_w_child']]
df_weekday_time.describe().round(2)

In [None]:

# pd.plotting.scatter_matrix(df_weekday_time, figsize=(20,20))
df_weekday_time.corr(method='spearman')

### Statistical Tests

### Correlation

Overall could be assumed that age, sex and region will group particioners similiarly.

Correlations still are quite low generally with max values at around 0.43 and -0.27