# Exploring your data

EDA stands for *Exploratory Data Analysis*. It is an important step of any project that will give you an idea of the contents of your dataset so that that you can decide on what method to use to extract the relevant information. There are two parts in EDA: first you have to verify the content and formatting of your data and second you need to visualize it to get some insight into relations between variables and their distributions.

## Dataset

Of course the first task is to import the dataset or access it. In this course we always import simple csv files or have a folder full with images that we can import. However in most professional cases, the dataset is embedded in a database and you might need to do some work to get access to the data.

To illustrate the first steps of EDA we import here a small "made-up" dataset that make it easy to understand potential problems. It just contains some information about classical composers:

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

# Silence specific warnings
pd.set_option('future.no_silent_downcasting', True)

In [None]:
composers = pd.read_excel("../datasets/composers.xlsx", sheet_name=1)
composers

## Checking the data

The first thing we have to check is the type of data we have in the table. We can easily do this with the ```dtypes``` parameter:

In [None]:
composers.dtypes

We see that the death column is an *object* column, which means non-numerical or text. However it should really be a number! The problem is that some values are missing, and Pandas doesn't know what to do with those. The birth column also has a missing value, but it is recognized as *Not a Number* and Pandas can deal with it. The death column in contrast has a text value ```unknown```. How can we fix this? We can for example ```replace``` some values:

In [None]:
composers.death.replace('unknown', np.nan)

Of course we actually need to assign these new column to our dataframe:

In [None]:
composers.death = composers.death.replace('unknown', np.nan)
composers

In [None]:
composers.dtypes

We can see that the column is still of type `object`. We can map both the birth and the death column to integers, which in this case is better than float because it will save some space and we are only interested in discrete years. However, we cannot just use the normal `int` python datatype because it does not support NaNs. We could either go back to using a float, or we use a Pandas datatype such as the [16-bit Nullable Interger](https://pandas.pydata.org/docs/reference/api/pandas.Int16Dtype.html):

In [None]:
composers.birth = composers.birth.astype(pd.Int16Dtype())
composers.death = composers.death.astype(pd.Int16Dtype())

composers

In [None]:
composers.dtypes

Now we preserved the NaN (or <NA>) values. Do we actually need them? In some cases we can just leave them and they are just discarded. For example we can ask Pandas to compute the mean of the columns and it just discards those values.

In [None]:
composers.mean(numeric_only=True)

If there are only a few values and we made sure they are not "important" (e.g. they do not represent a very specific class of data), we can just discard them. Again, we can use Pandas for this:

In [None]:
composers = composers.dropna()
composers

## Visualize relations between data

To understand relationships between features as well as their distributions we can could plot histograms and scatter plots for all of them. Instead of doing this manually, we can use a very useful Seaborn function called pair-plot. To get interesting plots we now turn back to our wine dataset:

In [None]:
wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
wine.head()

We use ```sns.pairplot``` only on a few columns of the dataset so that we can better see the plots:

In [None]:
sns.set_style("darkgrid")
sns.pairplot(wine.iloc[:,[0, 1, 2, 3, -1]]);

As you can see this provides us on the diagonal a histogram for each feature in the DataFrame and at other positions a scatter plot for all possible pairs of variables. This type of plot can give us visually already a lot of information. For example:

- The pairplot displays histograms along the diagonal, showing the distribution of each feature individually, while the off-diagonal plots are scatter plots that visualize relationships between every pair of features. This type of visualization helps us quickly identify patterns and potential issues in the dataset.
- Fixed acidity and citric acid appear to be highly correlated. Since they provide similar information, we might consider removing one to simplify the dataset without losing much predictive power.
- The histogram of residual sugar is right-skewed, with most values concentrated at lower levels and a long tail extending to the right. This suggests the presence of outliers—likely sweet wines—which could distort our analysis if not handled properly.
- Residual sugar outliers influence multiple relationships in the dataset. For example, when examining how sugar relates to fixed acidity, those extreme values could disproportionately affect correlation estimates. This is something we should account for in modeling.
- Citric acid has a disproportionately large number of very low (or zero) values. This suggests that many wines in the dataset contain little or no citric acid. Such a sharp cutoff might indicate a data collection artifact or a categorical effect, which could introduce bias in our models.
- The quality feature is highly imbalanced. Most wines are rated 5 or 6, while there are relatively few with ratings of 4 or 7. This imbalance is important to consider when building a predictive model, as a naïve classifier that always predicts 5 or 6 would achieve high accuracy but fail to provide meaningful predictions.

## Next steps

As next step, we might want to correct for some of the above observations (e.g. remove outliers). We will see practical example later on when we try to use ML methods. 

## Exercise

1. Import the dataset `kc_house_data.csv` from the dataset directory. It is a dataset about the price of houses in California with information such as number bedrooms, surface etc..

2. Use the ```pairplot``` function to looks at relations between variables. Use only the columns 1 to 8 to avoid having too many plots (ignore the first column which is only an index).

3. What do you observe in the relation between the variables ```price``` vs ```sqft_living```? Do you think you can predict the price with the ```sqft_living``` variable?

4. The bedroom distribution (on the diagonal) is strange. Make a single histogram (```sns.histplot```) with just this variable. Does the plot look ok? If not what can you try to adjust and why?

5. Do we have the same number of houses with all number of bedrooms? If not, how could this be a problem in the frame of Machine Learning?