## Chi-square analysis (Corelation)

### 1. Import Libraries

In [2]:
import pandas as pd
import scipy as sci
from scipy import stats

### 2. Categorical Data Example:
- Imagine we have data on customer Product satisfaction (Satisfied, Neutral, Dissatisfied) and their age group (20-30, 31-40, 41-50).

In [9]:
# Sample data (categorical)
data = {'Satisfaction': ['Satisfied', 'Neutral', 'Dissatisfied', 'Satisfied', 'Dissatisfied'],
        'Age Group': ['20-30', '31-40', '41-50', '51-60', '61-70']}

# Create pandas DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Satisfaction,Age Group
0,Satisfied,20-30
1,Neutral,31-40
2,Dissatisfied,41-50
3,Satisfied,51-60
4,Dissatisfied,61-70


### 3. Contingency Table:
- We can create a contingency table to see the distribution of satisfaction across different age groups.

In [10]:
contingency_table = pd.crosstab(df['Satisfaction'], df['Age Group'])
print(contingency_table)

Age Group     20-30  31-40  41-50  51-60  61-70
Satisfaction                                   
Dissatisfied      0      0      1      0      1
Neutral           0      1      0      0      0
Satisfied         1      0      0      1      0


This will print a table showing the number of customers in each satisfaction category for each age group.

### 4. Chi-Square Test (using SciPy):
- From scratch, calculating chi-square can be complex. We'll leverage scipy.stats.chi2_contingency for convenience.

In [13]:
# Perform chi-square test
chi2_statistic, p_value, expected_frequency, observed_frequency = stats.chi2_contingency(contingency_table.values)

# Print results
print("Chi-Square Statistic:", chi2_statistic)
print("p-value:", p_value)


Chi-Square Statistic: 10.000000000000004
p-value: 0.2650259152973615


### Chi-Square Test Results:

- The chi-square test statistic is 10.0000, and the p-value is 0.2650.

  - Chi-Square Statistic (10.0000): While not a definitive measure, a higher chi-square statistic suggests a stronger difference between the observed and expected frequencies. In this case, 10 indicates some difference, but it might not be substantial.
  - p-value (0.2650): This p-value is greater than the commonly used significance level of 0.05. A higher p-value suggests that the observed difference between expected and observed frequencies could be due to random chance.

#### Conclusion:

- Based on these results, there's not enough evidence to reject the null hypothesis. In other words, there's no strong indication of a statistically significant association between satisfaction level and age group in your sample data.

#### Important Considerations:

- This analysis is based on a small sample size (5 data points). Chi-square tests can be unreliable with small samples.
- The chi-square test is sensitive to expected cell counts. If some expected counts are very low, the test might not be reliable.