## 1. Pearson Correlation Coefficient:

**Simple Explanation:** It's a measure that tells us how strongly two things are related. The value ranges from -1 to 1. If it's close to 1, it means they're strongly positively related; if it's close to -1, they're strongly negatively related; and if it's close to 0, they're not really related.

**Real-world analogy:** Imagine you're observing the relationship between the amount of time students study and their test scores. If students who study more tend to score higher, there's a positive correlation. If students who study more tend to score lower (which is unlikely), there's a negative correlation.


## 2. Pandas

### `corr()`

**Simple Explanation:** It's a handy tool in pandas that calculates the correlation between columns in a DataFrame.

**Example with Sample Data:**

In [2]:
import pandas as pd

# Sample data: Hours studied vs Test scores of students
data = {
    'Hours_Studied': [1, 2, 3, 4, 5],
    'Test_Score': [50, 60, 65, 70, 90]
}
df = pd.DataFrame(data)


Unnamed: 0,Hours_Studied,Test_Score
0,1,50
1,2,60
2,3,65
3,4,70
4,5,90


In [3]:
# Calculate correlation using pandas
correlation_matrix = df.corr()

correlation_matrix

Unnamed: 0,Hours_Studied,Test_Score
Hours_Studied,1.0,0.959403
Test_Score,0.959403,1.0


## 3. scipy

### `pearsonr()`, `spearmanr()`, `kendalltau()`

**Simple Explanation:** These are functions from the `scipy` library that help us calculate different types of correlations.

- **pearsonr()**: Calculates the Pearson correlation coefficient (what we discussed above).
- **spearmanr()**: Measures the strength and direction of the relationship between two ranked variables.
- **kendalltau()**: Another way to measure the relationship between rankings.

**Real-world analogy:** Imagine a race. Pearson checks if faster shoes correlate with faster race times. Spearman and Kendall check the order in which racers finish (rankings) rather than their exact times.

**Example with Sample Data:**

In [4]:
from scipy.stats import pearsonr, spearmanr, kendalltau

# Using the same data from above
hours = df['Hours_Studied']
scores = df['Test_Score']

# Calculate correlations using scipy
pearson_corr, _ = pearsonr(hours, scores)
spearman_corr, _ = spearmanr(hours, scores)
kendall_corr, _ = kendalltau(hours, scores)

print(f"Pearson Correlation: {pearson_corr}")
print(f"Spearman Correlation: {spearman_corr}")
print(f"Kendall Correlation: {kendall_corr}")


Pearson Correlation: 0.9594032236002469
Spearman Correlation: 0.9999999999999999
Kendall Correlation: 0.9999999999999999


---

### Dataset: Sales Data of a Retail Store

Imagine a retail store that sells electronics. They've collected data over several months, capturing:

- Date of sale
- Product type (e.g., TV, Laptop, Mobile, etc.)
- Sale price
- Customer age
- Customer gender
- Whether the customer is a repeat customer
- Rating given by the customer for the product

Here's a snippet of the dataset:

```python
# Sample sales data
data = {
    'Date': ['2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03'],
    'Product': ['TV', 'Laptop', 'Mobile', 'TV'],
    'Sale_Price': [500, 1000, 800, 550],
    'Customer_Age': [25, 30, 22, 28],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Repeat_Customer': [True, False, True, False],
    'Rating': [4.5, 4.0, 5.0, 3.5]
}
```

### Simple Explanations with Real-world Examples:

1. **Correlation between Sale Price and Customer Age**:
   - **Real-world Scenario:** Do older customers tend to buy more expensive items?
   - **Code:** `correlation = df['Sale_Price'].corr(df['Customer_Age'])`
   - If the correlation is positive and close to 1, it means older customers tend to buy pricier items.

2. **Average Rating by Product**:
   - **Real-world Scenario:** Which product has the highest satisfaction among customers?
   - **Code:** `avg_rating = df.groupby('Product')['Rating'].mean()`
   - This will show the average rating for each product. Higher ratings indicate better customer satisfaction.

3. **Sales by Gender**:
   - **Real-world Scenario:** Do male customers buy more electronics than female customers in this store?
   - **Code:** `sales_by_gender = df.groupby('Gender')['Sale_Price'].sum()`
   - This will show total sales for male and female customers. Comparing the values will give an insight into who spends more.

4. **Percentage of Repeat Customers**:
   - **Real-world Scenario:** How loyal are the store's customers?
   - **Code:** `repeat_percentage = df['Repeat_Customer'].mean() * 100`
   - A higher percentage indicates that many customers come back to buy again, showing good customer loyalty.

By breaking down the complex dataset into smaller questions and using simple real-world scenarios, we can make the data more understandable and relatable.

In [5]:
import pandas as pd

# Sample sales data
data = {
    'Date': ['2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03'],
    'Product': ['TV', 'Laptop', 'Mobile', 'TV'],
    'Sale_Price': [500, 1000, 800, 550],
    'Customer_Age': [25, 30, 22, 28],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Repeat_Customer': [True, False, True, False],
    'Rating': [4.5, 4.0, 5.0, 3.5]
}
df = pd.DataFrame(data)

# 1. Correlation between Sale Price and Customer Age
correlation = df['Sale_Price'].corr(df['Customer_Age'])
print(f"Correlation between Sale Price and Customer Age: {correlation:.2f}")

# 2. Average Rating by Product
avg_rating = df.groupby('Product')['Rating'].mean()
print("\nAverage Rating by Product:")
print(avg_rating)

# 3. Sales by Gender
sales_by_gender = df.groupby('Gender')['Sale_Price'].sum()
print("\nSales by Gender:")
print(sales_by_gender)

# 4. Percentage of Repeat Customers
repeat_percentage = df['Repeat_Customer'].mean() * 100
print(f"\nPercentage of Repeat Customers: {repeat_percentage:.2f}%")


Correlation between Sale Price and Customer Age: 0.28

Average Rating by Product:
Product
Laptop    4.0
Mobile    5.0
TV        4.0
Name: Rating, dtype: float64

Sales by Gender:
Gender
Female    1550
Male      1300
Name: Sale_Price, dtype: int64

Percentage of Repeat Customers: 50.00%


Based on the sample data and the code provided, here's a conclusion for each scenario:

1. **Correlation between Sale Price and Customer Age**:
   - The correlation value will give an indication of the relationship between the age of the customer and the price of the product they buy. 
     - If the correlation is close to 1, it suggests that older customers tend to buy more expensive items.
     - If it's close to -1, younger customers tend to buy pricier items.
     - If it's close to 0, there's no strong relationship between age and the price of the product purchased.

2. **Average Rating by Product**:
   - This will show which products are most liked by the customers. A product with a higher average rating indicates better customer satisfaction compared to products with lower ratings.

3. **Sales by Gender**:
   - This will indicate which gender, male or female, tends to spend more on electronics in this store. If sales are higher for one gender, it suggests that gender has a preference or higher purchasing frequency for electronics in this specific store.

4. **Percentage of Repeat Customers**:
   - A high percentage indicates that many customers return to the store to make additional purchases, suggesting good customer loyalty and satisfaction with the store's products or services.

**Overall Conclusion**:
From the sample data, we can gather insights into customer preferences, behaviors, and satisfaction levels. These insights can be used by the store to make informed decisions, such as which products to promote, how to target marketing campaigns, or areas to improve for better customer satisfaction.