In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('Housing/Housing.csv')
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
df.isnull().sum()

In [None]:
df_encoded = df.copy()

In [None]:
cat_cols = [col for col in df_encoded.select_dtypes('object').columns]
cat_cols

In [None]:
furnishingstatus_mapping = {
    'unfurnished': 0,
    'semi-furnished': 1,
    'furnished': 2
}

df_encoded['furnishingstatus'] = df['furnishingstatus'].map(furnishingstatus_mapping)

In [None]:
cat_cols = [col for col in df_encoded.select_dtypes('object').columns]
cat_cols

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in cat_cols:
    df_encoded[col] = le.fit_transform(df[col])

df_encoded.head()

In [None]:
df.head()

In [None]:
df['furnishingstatus'].value_counts()

In [None]:
df_encoded.describe().T

## 2. Identifier les variables d’intérêt : la variable endogène Y et les variables potentiellement explicatives X, et faire des hypothèses sur le sens des relations statistiques


In my dataset, I would consider '**price**' as the endogenous variable Y, because it's typically the target variable in housing datasets. This is the variable I'm trying to predict or explain.

The potential explanatory variables (exogenous variables X) could be all the other variables in my dataset: '**area**', '**bedrooms**', '**bathrooms**', '**stories**', '**mainroad**', '**guestroom**', '**basement**', '**hotwaterheating**', '**airconditioning**', '**parking**', '**prefarea**', '**furnishingstatus**'. These are the variables that could have an impact on the price of the house.

Here are some hypotheses on the direction of the statistical relationships:

- '**area**': I expect a positive relationship. Larger houses (in terms of area) are usually more expensive.
- '**bedrooms**', '**bathrooms**', '**stories**': I expect a positive relationship. Houses with more bedrooms, bathrooms, or stories are usually more expensive.
- '**mainroad**': I expect a positive relationship. Houses located on a main road might be more expensive due to better accessibility.
- '**guestroom**', '**basement**', '**hotwaterheating**', '**airconditioning**': I expect a positive relationship. Houses with these amenities are usually more expensive.
- '**parking**': I expect a positive relationship. Houses with more parking spaces are usually more expensive.
- '**prefarea**': I expect a positive relationship. Houses in a preferred area are usually more expensive.
- '**furnishingstatus**': This could be either positive or negative depending on the level of furnishing. Fully furnished houses might be more expensive than semi-furnished or unfurnished houses.

These are just hypotheses and the actual relationships need to be determined through data analysis.

## 3. Distribution de chacune de ces variables (+ représentations graphiques si besoin) et principaux indicateurs de statistique univariée

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# For each column in the DataFrame
for col in df_encoded.columns:
    # Plot a histogram
    plt.figure(figsize=(10, 6))
    sns.histplot(df_encoded[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

    # Print main statistical indicators
    print(df_encoded[col].describe())

## 4. Certaines variables X sont-elles corrélées ?

In [None]:
corr_matrix = df_encoded.corr()

In [None]:
import seaborn as sns

# Create a heatmap from the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

In [None]:
corr_matrix['price'].sort_values(ascending=False)

## 5. Analyser le lien statistique entre X et Y : existe-t-il a priori ? Si oui, est-il significatif ?

From the sorted table of correlations between the variables and the price above we can see that in our case, all the correlation coefficients are positive, which means that all the variables tend to increase as the price increases.
<br>
However, the strength of these relationships varies:  
* '**area**', '**bathrooms**', '**airconditioning**', '**stories**', and '**parking**' have the highest correlation coefficients with '**price**', all being above 0.38. This suggests that these variables have a relatively strong positive relationship with 'price'. In other words, houses with larger areas, more bathrooms, air conditioning, more stories, and more parking spaces tend to be more expensive.
<br>  
* '**bedrooms**', '**prefarea**', '**furnishingstatus**', '**mainroad**', '**guestroom**', and '**basement**' have moderate correlation coefficients with '**price**', ranging from 0.19 to 0.37. This suggests that these variables have a moderate positive relationship with 'price'. For example, houses with more bedrooms, located in a preferred area, with a higher furnishing status, on a main road, with a guest room, and with a basement tend to be somewhat more expensive, but the effect is not as strong as the previous variables.
<br>
* '**hotwaterheating**' has the lowest correlation coefficient with '**price**', at 0.093. This suggests that this variable has a weak positive relationship with 'price'. Houses with hot water heating might be slightly more expensive, but the effect is weak. 

## 6. Modéliser ce lien par l’analyse de régression : estimer les paramètres, tester leur significativité et interpréter les résultats

In [None]:
import statsmodels.api as sm

# Define X and Y
X = df_encoded.drop('price', axis=1)
Y = df_encoded['price']

# Add a constant to X
X = sm.add_constant(X)

# Create a model
model = sm.OLS(Y, X)

# Fit the model
results = model.fit()

# Print the summary
results.summary()

* **R-squared**: This is the coefficient of determination. It tells you the proportion of the variance in the dependent variable that is predictable from the independent variables. In this case, it's 0.680, which means that about 68% of the variability in 'price' can be explained by the independent variables in the model.  
* **Adj. R-squared**: This is the adjusted R-squared, which adjusts the statistic based on the number of independent variables in the model. It's slightly less than the R-squared, which is expected as it penalizes the addition of uninformative predictors in the model.  
* **coef**: These are the coefficients for each variable. They represent the change in the dependent variable (price) for a one-unit change in the corresponding independent variable, assuming all other variables are held constant. For example, the coefficient for 'area' is 243.9069, which suggests that for each additional unit of area, we can expect the price to increase by approximately 243.9069 units, assuming all other variables are held constant.  
* **std err**: This is the standard error of the estimate of the coefficient. Smaller values are better as they indicate that the estimate of the coefficient is more precise.  
* **t**: This is the t-statistic. It's the coefficient divided by the standard error.  
* **P>|t|**: This is the p-value. A p-value less than 0.05 is typically considered to indicate a statistically significant coefficient. For example, the p-value for 'area' is 0.000, which suggests that area is a statistically significant predictor of price.  
* **\[0.025 0.975\]**: These are the 95% confidence intervals for the coefficients. If the interval does not contain zero, it suggests that the variable is a significant predictor of the dependent variable.  
* **Omnibus/Prob(Omnibus)**: These tests are for the skewness and kurtosis of the residual (the difference between the observed and predicted values). A Prob(Omnibus) close to zero indicates that the residuals are not normally distributed.  
* **Durbin-Watson**: This tests for homoscedasticity. Values between 1 and 2 generally indicate that the residuals are homoscedastic and errors are uncorrelated.  
* **Jarque-Bera (JB)/Prob(JB)**: This is another test of the skewness and kurtosis of the residuals. A Prob(JB) close to zero indicates that the residuals are not normally distributed.  
* **Cond. No.**: This is a test for multicollinearity. A large condition number (above 20) indicates potential problems with multicollinearity.  
In our model, most of the variables seem to be significant predictors of price (p-value < 0.05), except for '**bedrooms**'. The model might suffer from multicollinearity (Cond. No. is large) and the residuals are not normally distributed (Prob(Omnibus) and Prob(JB) are close to zero).

## 7. A ce stade, l’analyse semble-t-elle présenter des biais statistiques ? (biais de variable omise, causalité inverse, outliers perturbant le lien entre X et Y ?)

* **Omitted Variable Bias**: This occurs when a variable that influences the dependent variable is not included in the model. If such a variable also correlates with variables included in the model, it can lead to biased and inconsistent estimates. ***In this case, there might be other factors influencing the price of a house (like the age of the house, proximity to amenities, etc.) that are not included in the dataset.***
* **Reverse Causality**: This refers to a situation where the dependent variable is causing or influencing the independent variable, rather than the other way around. In this context, it's unlikely that the price of a house would influence its characteristics (like area, number of bedrooms, etc.), so ***reverse causality is probably not a concern here***.
* **Outliers**: Outliers can have a large influence on the results of a regression analysis, especially if the sample size is small. If there are houses in the dataset with characteristics or prices that are significantly different from the others, they could be influencing the results. You can check for outliers by examining the residuals from the regression (the difference between the observed and predicted values). Large residuals could indicate the presence of outliers. 

## 8. L’analyse pourrait-elle être affinée en ré-estimant le lien entre X et Y pour différents groupes d’individus ?

I think that yes, the analysis could potentially be refined by re-estimating the relationship between X and Y for different groups of individuals. This is often referred to as "stratified analysis" or "grouped analysis".  For example, if you believe that the relationship between the independent variables and the price might be different for houses with and without air conditioning, you could split the dataset into two groups based on the '**airconditioning**' variable and run separate regressions for each group. 