# **Exploratory Data Analysis (EDA)**
- Summarize main characteristics of the data
- Gain better understanding of the data set
- Uncover relationsheips between variables
- Extract important variables <br><br>
**Question**
*What are the characteristics which have the most impact on the car price?*
<br>
## Learning Objectives
- Descriptive Statistics
- GroupBy
- ANOVA
- Correlation
- Advance Correlation - Statistics

## **Descriptive Statistics**
- Explore data before building complicated models
- Calculate descriptive statistics for data
- Describe basic features of data
- Giving short summaries about the sample and measures of data
1. df.describe()
2. value_counts()
   - drive_wheels_counts = df['drive-wheels'].value_counts()
3. Box plots
   - Median (midle)
   - Upper Quartile (75th percentile)
   - Lower Quartile (25th percentile)
   - Upper Extreme 1.5 times above the upper quartile
   - Lower Extreme
   - Outliers/single data point
   - sns.boxplot (x = 'drive-wheels', y='price', data=df)
4. Scatter Plot
   - Each observation represented as a point
   - Scatter plots show the relationship between two variables:
       1. Predictor/independent variables on x-axis
       2. Target/dependent variables on y-axis
     - y = df['price']
     - x = df['engine-size']
     - plt.scatter(x,y)
     - plt.title('Scatterplot of engine size vs price')
     - plt.xlabel('Engine Size')
     - plt.ylabel('price')

## **GroupBy in Python**
**Question:** <br>
*is there any relationship between the different types of 'drive system' and the 'price' of the vehicles?*
- Use panda df.groupby() method
  - Can be applied to categorical variables
  - Group data into categories
  - single or multiple variables
    - df_test = df[['drive-wheels', 'body-style', 'price']]
    - df-grp = df_test.groupby(['drive-wheels', 'body-style'], as_index = False).mean()

### Pivot tables
**One variable is displayed along the columns, and the other variable is displayed along the rows**
<br>
df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')

### Heatmap
**plot target variable against multiple variables**
plt.pcolor(df_pivot, cma='RdBu') <br>
plt.colorbar()<br>
plt.show()

### Correlation
**Measures to what extent different variables are interdepenent**
- Lung cancer to smoking
- Rain to Umbrella
- Correlation between two features (engine-size and price)\
  - sns.regplot(x='engine-size', y='price', data=df)
  - plt.ylim(0,)

### Correlation - Statistics
#### **Pearson Correlation**
**- Measure the strength of the correlation between two features**
  - Correlation coefficient
  - P-value <br>
**Correlation coefficient** 
  - close to +1: Large Positive relationship
  - close to -1: Large Negative relationship
  - close to 0: no relationship <br>
**P-Value**
  - p-value < 0.001 strong certainty in result
  - p-value < 0.05 moderate certainty in result
  - p-value < 0.1 weak certainty in the result
  - p-value > 0.1 No certainty<br>
**Strong Correlation**
  - Correlation coefficient close to 1 or -1
  - P value less than 0.001 <br>

pearson_coef, p_value = stats.pearsonr(df['horse'], df['price'])

# **Chi-Square Test for Categorical Variables**
The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is widely used in various fields, including social sciences, marketing, and healthcare, to analyze survey data, experimental results, and observational studies.

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency

# Create the contingency table
data = [[20, 30],  # Male: [Like, Dislike]
        [25, 25]]  # Female: [Like, Dislike]

# Create a DataFrame for clarity
df = pd.DataFrame(data, columns=["Like", "Dislike"], index=["Male", "Female"])

# Perform the Chi-Square Test
chi2, p, dof, expected = chi2_contingency(df)

# Display results
print("Chi-square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)
print("Expected Frequencies:\n", expected)

Chi-square Statistic: 0.6464646464646464
Degrees of Freedom: 1
P-value: 0.4213795037428696
Expected Frequencies:
 [[22.5 27.5]
 [22.5 27.5]]
