In [30]:
import pandas as pd
import numpy as np
pd.options.plotting.backend = 'plotly'

In [7]:
%%bash
kaggle datasets download -d maajdl/yeh-concret-data -p ./data/
unzip ./data/yeh-concret-data.zip -d ./data/
rm ./data/yeh-concret-data.zip

Downloading yeh-concret-data.zip to ./data



100%|██████████| 10.2k/10.2k [00:00<00:00, 14.1MB/s]


Archive:  ./data/yeh-concret-data.zip
  inflating: ./data/Concrete_Data_Yeh.csv  


In [3]:
def load_data(path: str='./data/Concrete_Data_Yeh.csv') -> pd.DataFrame:
    try:
        df = pd.read_csv(path)
        return df
    except Exception as e:
        print(f'error while trying to load in data: {e}')
        return
    
df = load_data()

# EDA
**E**xploratory **D**ata **A**nalysis is the core task in scientist/analyst workflow which is concerned with identifying, exploring and analyzing data to gain insights and indentify patterns, relationships and anomalies in the data.

**EDA** involves several steps:
* Data sourcing
    - Data can come from different sources such as databases, excel sheets, text files or scraping the web.
* Data cleaning
    - Some rows can have inperfections, the degree of how perfect the data is reflected on its quality which can be measured with the following metrics: _these are also known as data quality dimensions_
        * Accuracy
        * Completeness
        * Consistency
        * Timeliness
        * Validity
        * And a lot more dimensions..
* Data visualization
    - A powerful tool for exploring patterns and relationships in the data but visually. Graphs and charts can reveal trends outliers and correlations that may not appear in summary statistics alone.
* Summary statistics
    - Statistics such as the mean, median, std and range can provide a view of the data distribution. They can also help identify unusual values or patterns in the data.
* Hypothesis testing
    - Hypothesis testing is used to determine whether a pattern or relationship in the data is statistically significant.
* Iteration
    - EDA is an iterative process that involves refining hypotheses, exploring new variables and testing different visualizations and statistical methods to gain insights about the data.

[Adapted from this great medium article.](https://ankushmulkar.medium.com/complete-exploratory-data-analysis-step-by-step-guide-for-data-analyst-34a07156217a)

Summary:
- EDA is the process of identifying patterns.
- EDA involves:
    1. Collecting
    2. Cleaning
    3. Visualizing
    4. Summerizing
    5. Testing
    6. Iterating
- Collecting is about gathering data from multiple resources, loading them and documenting them.
- Cleaning data makes sure that data is of high quality via removing missing, outlier, duplicated, inconsistent data.
- Visualzing data makes pattern discovery easier by using a visual process.



In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   cement            1030 non-null   float64
 1   slag              1030 non-null   float64
 2   flyash            1030 non-null   float64
 3   water             1030 non-null   float64
 4   superplasticizer  1030 non-null   float64
 5   coarseaggregate   1030 non-null   float64
 6   fineaggregate     1030 non-null   float64
 7   age               1030 non-null   int64  
 8   csMPa             1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB


In [6]:
df.describe()

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


## Data Cleaning
This step is concerned with refining the quality of the data in order to derive quality insights.

Things to lookout for when cleaning data are:
- Missing data
- Duplicated data
- Outliers
- Inconsistent data


In [18]:
# handling missing
def check_missing(df: pd.DataFrame):
    '''
    Returns the percentage of missing data.
    '''
    return df.isna() \
            .sum() \
            .div(df.shape[0]) \
            .mul(100).sort_values()
df.pipe(check_missing)

cement              0.0
slag                0.0
flyash              0.0
water               0.0
superplasticizer    0.0
coarseaggregate     0.0
fineaggregate       0.0
age                 0.0
csMPa               0.0
dtype: float64

No missing data is present

In [38]:
print(f'percentage of duplicated data:\t{np.round(np.divide(df.duplicated().sum(), df.shape[0]), 2)}')
print(f'number of duplicated rows:\t{df.shape[0] * .02}')

percentage of duplicated data:	0.02
number of duplicated rows:	20.6


In [31]:
df.plot(kind='box')

From the plot there doesnt seem a lot of outliers that exist in the data, will experiment with how does that effect the performance of the regressor.

Univariate analysis is the statistical analysis of a single variable, there are two types of variables
1. Categorical
2. Quantitative

Will skip to Quantitative analysis.

**Clarifying things:**
- In statistics there are two ~~measurements~~ characteristics that can be described in a variable, central tendency and dispersion.
    * Central tendency refers to a central or typical value around which the data tends to cluster.
    * Dispersion or spread, refers to the extent to which data points in a dataset are spread out from the center.

Summary:
In statistics we have characteristics in a dataset, It's **Central tendency**, a value that data gathers around, and **Dispersion**, how spread the data is.

One of the ways we can identify patterns in data is by using these two characteristics.
And statistics as a whole can be used to capture patterns, the more understanding of statistics the more complex and hidden patterns that can be uncovered.

In [None]:
# TODO:
# 1. Finish EDA in next session to begin modeling process.