# Complete Data Exploration Project

## Putting It All Together!

In this notebook, we'll combine **NumPy**, **Pandas**, and **Matplotlib** to perform a complete data analysis workflow.

### What You'll Learn:
- Load and explore real-world data
- Calculate meaningful statistics
- Visualize patterns and trends
- Draw insights from data

This is what data scientists do every day!

## Import Libraries

First, let's import all the tools we need for data analysis:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Make plots look better
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("All libraries imported successfully!")

## The Project: Weather Data Analysis

### Scenario:
You're a meteorologist analyzing weather patterns for a city over 30 days.

### Goals:
- Explore **temperature**, **humidity**, and **rainfall** patterns
- Find correlations between weather variables
- Visualize trends to communicate findings
- Answer key questions about the weather

## Step 1: Load the Data

Let's create our weather dataset with 30 days of measurements:

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Create 30 days of weather data
dates = pd.date_range(start='2024-01-01', periods=30, freq='D')

# Generate realistic weather data
temperature = np.random.normal(loc=22, scale=5, size=30)  # Average 22°C, varies by 5°C
humidity = np.random.uniform(40, 90, size=30)  # 40-90% humidity
rainfall = np.random.exponential(scale=5, size=30)  # Rainfall in mm (exponential distribution)

# Create DataFrame
weather_df = pd.DataFrame({
    'Date': dates,
    'Temperature': temperature.round(1),
    'Humidity': humidity.round(1),
    'Rainfall': rainfall.round(1)
})

print("Weather data loaded successfully!")
print(f"\nDataset shape: {weather_df.shape}")
print(f"Date range: {weather_df['Date'].min()} to {weather_df['Date'].max()}")

In [None]:
# Display first few rows
print("First 5 days of weather data:")
weather_df.head()

In [None]:
# Get dataset information
print("Dataset Information:")
weather_df.info()

In [None]:
# Statistical summary
print("Statistical Summary:")
weather_df.describe()

## Step 2: Data Analysis

Now let's calculate key statistics and find patterns in the data:

In [None]:
# Calculate key statistics
avg_temp = weather_df['Temperature'].mean()
max_temp = weather_df['Temperature'].max()
min_temp = weather_df['Temperature'].min()

# Count rainy days (rainfall > 5mm)
rainy_days = (weather_df['Rainfall'] > 5).sum()

# Calculate total rainfall
total_rainfall = weather_df['Rainfall'].sum()

print("Weather Statistics:")
print(f"  Average Temperature: {avg_temp:.1f}°C")
print(f"  Maximum Temperature: {max_temp:.1f}°C")
print(f"  Minimum Temperature: {min_temp:.1f}°C")
print(f"  Rainy Days (>5mm): {rainy_days} days")
print(f"  Total Rainfall: {total_rainfall:.1f} mm")

In [None]:
# Find correlations between variables
print("\nCorrelation Analysis:")
correlations = weather_df[['Temperature', 'Humidity', 'Rainfall']].corr()
print(correlations)

# Highlight key correlation
humidity_rain_corr = correlations.loc['Humidity', 'Rainfall']
print(f"\nHumidity-Rainfall Correlation: {humidity_rain_corr:.3f}")
if humidity_rain_corr > 0.5:
    print("Strong positive correlation - high humidity tends to bring more rain!")
elif humidity_rain_corr < -0.5:
    print("Strong negative correlation - interesting pattern!")
else:
    print("Weak correlation - humidity and rainfall are somewhat independent.")

In [None]:
# TODO: YOUR TURN!
# Find the hottest and coldest days
# Hint: Use idxmax() and idxmin() methods
# Then print the date and temperature for each

# Your code here:


## Step 3: Data Visualization

Let's create visualizations to understand the patterns better:

In [None]:
# Create a figure with 3 subplots
fig, axes = plt.subplots(3, 1, figsize=(12, 10))
fig.suptitle('30-Day Weather Analysis', fontsize=16, fontweight='bold')

# Plot 1: Temperature over time (Line plot)
axes[0].plot(weather_df['Date'], weather_df['Temperature'], 
             marker='o', color='orangered', linewidth=2, markersize=4)
axes[0].axhline(y=avg_temp, color='blue', linestyle='--', 
                label=f'Average: {avg_temp:.1f}°C')
axes[0].set_title('Temperature Trend Over Time', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Temperature (°C)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
plt.setp(axes[0].xaxis.get_majorticklabels(), rotation=45)

# Plot 2: Average temperature by week (Bar chart)
weather_df['Week'] = (weather_df.index // 7) + 1
weekly_temp = weather_df.groupby('Week')['Temperature'].mean()
axes[1].bar(weekly_temp.index, weekly_temp.values, 
            color=['steelblue', 'seagreen', 'coral', 'mediumpurple', 'gold'][:len(weekly_temp)],
            alpha=0.8, edgecolor='black')
axes[1].set_title('Average Temperature by Week', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Week Number')
axes[1].set_ylabel('Average Temperature (°C)')
axes[1].grid(True, alpha=0.3, axis='y')

# Plot 3: Humidity vs Rainfall (Scatter plot)
scatter = axes[2].scatter(weather_df['Humidity'], weather_df['Rainfall'],
                         c=weather_df['Temperature'], cmap='coolwarm',
                         s=100, alpha=0.6, edgecolor='black')
axes[2].set_title('Humidity vs Rainfall (colored by Temperature)', 
                  fontsize=12, fontweight='bold')
axes[2].set_xlabel('Humidity (%)')
axes[2].set_ylabel('Rainfall (mm)')
axes[2].grid(True, alpha=0.3)
cbar = plt.colorbar(scatter, ax=axes[2])
cbar.set_label('Temperature (°C)')

plt.tight_layout()
plt.show()

print("Visualizations created successfully!")

In [None]:
# TODO: YOUR TURN!
# Create a histogram of temperatures
# Show the distribution of temperature values
# Hint: Use plt.hist() with 10 bins
# Add title, labels, and grid

# Your code here:


## Step 4: Answer Key Questions

Let's use our data to answer important weather-related questions:

In [None]:
# Question 1: What's the average temperature?
print("Question 1: What's the average temperature?")
avg_temp = weather_df['Temperature'].mean()
print(f"Answer: The average temperature over 30 days is {avg_temp:.1f}°C")
print()

In [None]:
# Question 2: How many rainy days did we have?
print("Question 2: How many rainy days (>5mm rainfall) did we have?")
rainy_days = (weather_df['Rainfall'] > 5).sum()
percentage = (rainy_days / len(weather_df)) * 100
print(f"Answer: {rainy_days} rainy days ({percentage:.1f}% of the month)")
print()

In [None]:
# Question 3: Is there a correlation between humidity and rainfall?
print("Question 3: Is there a correlation between humidity and rainfall?")
correlation = weather_df['Humidity'].corr(weather_df['Rainfall'])
print(f"Answer: Correlation coefficient = {correlation:.3f}")

if abs(correlation) > 0.7:
    strength = "strong"
elif abs(correlation) > 0.4:
    strength = "moderate"
else:
    strength = "weak"

direction = "positive" if correlation > 0 else "negative"
print(f"There is a {strength} {direction} correlation.")
if correlation > 0.4:
    print("Higher humidity tends to be associated with more rainfall.")
print()

In [None]:
# TODO: YOUR TURN!
# Question 4: Find the week with the most rainfall
# Hint: Group by 'Week' column and sum the rainfall
# Then use idxmax() to find which week had the most
# Print the week number and total rainfall

# Your code here:


## Summary: Complete Data Analysis Workflow

Congratulations! You've just completed a real data science project!

### What We Did:

1. **Data Loading**
   - Created a realistic weather dataset
   - Explored the data structure with `.info()` and `.describe()`

2. **Data Analysis**
   - Calculated statistical measures (mean, max, min)
   - Found patterns using correlation analysis
   - Identified significant events (rainy days, temperature extremes)

3. **Data Visualization**
   - Line plots for trends over time
   - Bar charts for categorical comparisons
   - Scatter plots for relationships between variables
   - Histograms for distributions

4. **Insights & Communication**
   - Answered specific questions with data
   - Drew meaningful conclusions
   - Presented findings clearly

### Real-World Applications:

This workflow applies to ANY data analysis project:
- **Business**: Sales trends, customer behavior, inventory management
- **Science**: Experimental data, climate research, medical studies
- **Finance**: Stock prices, economic indicators, risk analysis
- **Sports**: Player statistics, game outcomes, performance tracking
- **Social Media**: User engagement, content trends, sentiment analysis

### Key Skills Mastered:
- NumPy for numerical computations
- Pandas for data manipulation and analysis
- Matplotlib for data visualization
- Statistical thinking and correlation analysis
- Professional data presentation

### Next Steps:
- Try analyzing your own datasets (CSV files, Excel, etc.)
- Explore more advanced visualizations (seaborn library)
- Learn machine learning to make predictions from data
- Build interactive dashboards with Plotly or Streamlit

**You now have the foundation to work with real-world data!**