# Week 1: Environment Setup & Introduction to Data Science

## Learning Objectives
By the end of this notebook, you will be able to:
- Understand what data science is and its applications
- Navigate Jupyter Notebooks effectively
- Use basic Python for data science
- Work with NumPy arrays and Pandas DataFrames
- Create simple visualizations

---

## 1. What is Data Science?

Data Science is an interdisciplinary field that combines:
- **Statistics & Mathematics**: For understanding patterns and relationships
- **Computer Science**: For processing and analyzing large datasets
- **Domain Expertise**: For asking the right questions and interpreting results

### Common Applications:
- **Business**: Customer segmentation, fraud detection, recommendation systems
- **Healthcare**: Drug discovery, medical diagnosis, epidemic modeling
- **Technology**: Search engines, social media algorithms, autonomous vehicles
- **Science**: Climate modeling, genomics, astronomy

### The Data Science Process:
1. **Ask Questions**: Define the problem
2. **Get Data**: Collect and gather relevant data
3. **Explore Data**: Understand what you're working with
4. **Model Data**: Apply statistical/ML techniques
5. **Communicate Results**: Share insights and recommendations

## 2. Python Environment Check

Let's make sure your environment is set up correctly!

In [None]:
# Check Python version
import sys
print(f"Python version: {sys.version}")

# Import essential libraries
try:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    print("✅ All essential libraries imported successfully!")
except ImportError as e:
    print(f"❌ Error importing libraries: {e}")
    print("Please install missing libraries using: pip install numpy pandas matplotlib seaborn")

## 3. Jupyter Notebook Basics

### Cell Types:
- **Code cells**: Execute Python code (like this one below)
- **Markdown cells**: Display formatted text (like this explanation)

### Keyboard Shortcuts:
- `Shift + Enter`: Run cell and move to next
- `Ctrl + Enter`: Run cell and stay in current cell
- `A`: Insert cell above
- `B`: Insert cell below
- `DD`: Delete cell
- `M`: Convert to Markdown
- `Y`: Convert to Code

In [None]:
# Try running this cell!
print("Hello, Data Science World!")
print("This is my first data science notebook.")

## 4. Python Fundamentals Review

Let's quickly review essential Python concepts for data science.

In [None]:
# Variables and basic data types
name = "Alice"  # string
age = 25        # integer
height = 5.6    # float
is_student = True  # boolean

print(f"Name: {name}, Age: {age}, Height: {height}, Student: {is_student}")

In [None]:
# Lists - ordered, mutable collections
numbers = [1, 2, 3, 4, 5]
fruits = ['apple', 'banana', 'orange']
mixed = [1, 'hello', 3.14, True]

print(f"Numbers: {numbers}")
print(f"First fruit: {fruits[0]}")
print(f"Last number: {numbers[-1]}")

In [None]:
# Dictionaries - key-value pairs
student = {
    'name': 'Bob',
    'age': 22,
    'grades': [85, 90, 78, 92]
}

print(f"Student name: {student['name']}")
print(f"Average grade: {sum(student['grades']) / len(student['grades'])}")

In [None]:
# Loops and conditionals
scores = [85, 92, 78, 96, 88]

for score in scores:
    if score >= 90:
        grade = 'A'
    elif score >= 80:
        grade = 'B'
    elif score >= 70:
        grade = 'C'
    else:
        grade = 'F'
    
    print(f"Score: {score} -> Grade: {grade}")

## 5. Introduction to NumPy

NumPy (Numerical Python) is the foundation of data science in Python. It provides:
- Efficient arrays (much faster than Python lists)
- Mathematical functions
- Linear algebra operations

In [None]:
import numpy as np

# Creating arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(f"1D Array: {arr1}")
print(f"2D Array:\n{arr2}")
print(f"Shape of arr2: {arr2.shape}")

In [None]:
# Array operations
numbers = np.array([1, 2, 3, 4, 5])

print(f"Original: {numbers}")
print(f"Add 10: {numbers + 10}")
print(f"Multiply by 2: {numbers * 2}")
print(f"Square: {numbers ** 2}")
print(f"Sum: {np.sum(numbers)}")
print(f"Mean: {np.mean(numbers)}")
print(f"Standard deviation: {np.std(numbers)}")

In [None]:
# Useful array creation functions
zeros = np.zeros(5)
ones = np.ones((3, 3))
range_arr = np.arange(0, 10, 2)
linspace_arr = np.linspace(0, 1, 5)
random_arr = np.random.random(5)

print(f"Zeros: {zeros}")
print(f"Ones:\n{ones}")
print(f"Range: {range_arr}")
print(f"Linspace: {linspace_arr}")
print(f"Random: {random_arr}")

## 6. Introduction to Pandas

Pandas is built on top of NumPy and provides:
- DataFrames (like Excel spreadsheets)
- Data cleaning and manipulation tools
- File I/O (CSV, Excel, JSON, etc.)

In [None]:
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
    'Salary': [70000, 80000, 90000, 75000, 85000]
}

df = pd.DataFrame(data)
print("Our first DataFrame:")
print(df)

In [None]:
# Basic DataFrame operations
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst 3 rows:")
print(df.head(3))
print(f"\nBasic statistics:")
print(df.describe())

In [None]:
# Selecting data
print("Names column:")
print(df['Name'])

print("\nAge and Salary columns:")
print(df[['Age', 'Salary']])

print("\nPeople older than 30:")
print(df[df['Age'] > 30])

In [None]:
# Adding new columns
df['Salary_K'] = df['Salary'] / 1000  # Salary in thousands
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Mature')

print("DataFrame with new columns:")
print(df)

## 7. Basic Data Visualization

Visualization is crucial for understanding data. Let's create some simple plots!

In [None]:
import matplotlib.pyplot as plt

# Set up the plotting style
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)

# Simple line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 4))
plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

In [None]:
# Bar plot using our DataFrame
plt.figure(figsize=(10, 6))
plt.bar(df['Name'], df['Age'], color='skyblue', edgecolor='navy', alpha=0.7)
plt.title('Age by Person')
plt.xlabel('Name')
plt.ylabel('Age')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Salary'], color='red', alpha=0.7, s=100)
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary ($)')
plt.grid(True, alpha=0.3)

# Add labels for each point
for i, name in enumerate(df['Name']):
    plt.annotate(name, (df['Age'].iloc[i], df['Salary'].iloc[i]), 
                xytext=(5, 5), textcoords='offset points')

plt.tight_layout()
plt.show()

## 8. Your First Data Analysis

Let's put it all together with a simple analysis!

In [None]:
# Create a larger dataset
np.random.seed(42)  # For reproducible results

n_students = 100
students_data = {
    'Student_ID': range(1, n_students + 1),
    'Math_Score': np.random.normal(75, 15, n_students),
    'Science_Score': np.random.normal(80, 12, n_students),
    'Hours_Studied': np.random.normal(5, 2, n_students)
}

# Ensure realistic ranges
students_data['Math_Score'] = np.clip(students_data['Math_Score'], 0, 100)
students_data['Science_Score'] = np.clip(students_data['Science_Score'], 0, 100)
students_data['Hours_Studied'] = np.clip(students_data['Hours_Studied'], 0, 12)

students_df = pd.DataFrame(students_data)
print("Student dataset:")
print(students_df.head())
print(f"\nDataset shape: {students_df.shape}")

In [None]:
# Basic statistics
print("Summary statistics:")
print(students_df.describe())

print(f"\nAverage Math Score: {students_df['Math_Score'].mean():.2f}")
print(f"Average Science Score: {students_df['Science_Score'].mean():.2f}")
print(f"Average Hours Studied: {students_df['Hours_Studied'].mean():.2f}")

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Histogram of Math scores
axes[0, 0].hist(students_df['Math_Score'], bins=20, color='lightblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Math Scores')
axes[0, 0].set_xlabel('Math Score')
axes[0, 0].set_ylabel('Frequency')

# Histogram of Science scores
axes[0, 1].hist(students_df['Science_Score'], bins=20, color='lightgreen', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Distribution of Science Scores')
axes[0, 1].set_xlabel('Science Score')
axes[0, 1].set_ylabel('Frequency')

# Scatter plot: Hours studied vs Math score
axes[1, 0].scatter(students_df['Hours_Studied'], students_df['Math_Score'], alpha=0.6, color='red')
axes[1, 0].set_title('Hours Studied vs Math Score')
axes[1, 0].set_xlabel('Hours Studied')
axes[1, 0].set_ylabel('Math Score')

# Scatter plot: Math vs Science scores
axes[1, 1].scatter(students_df['Math_Score'], students_df['Science_Score'], alpha=0.6, color='purple')
axes[1, 1].set_title('Math Score vs Science Score')
axes[1, 1].set_xlabel('Math Score')
axes[1, 1].set_ylabel('Science Score')

plt.tight_layout()
plt.show()

## 9. Practice Exercises

Now it's your turn! Complete these exercises to practice what you've learned.

### Exercise 1: Array Operations
Create a NumPy array with numbers from 1 to 20, then:
1. Find all even numbers
2. Calculate the square of each number
3. Find the mean and standard deviation

In [None]:
# Your code here
# Create array
arr = np.arange(1, 21)
print(f"Array: {arr}")

# Find even numbers
even_numbers = arr[arr % 2 == 0]
print(f"Even numbers: {even_numbers}")

# Calculate squares
squares = arr ** 2
print(f"Squares: {squares}")

# Mean and standard deviation
print(f"Mean: {np.mean(arr):.2f}")
print(f"Standard deviation: {np.std(arr):.2f}")

### Exercise 2: DataFrame Manipulation
Create a DataFrame with information about 5 movies (title, year, rating, genre), then:
1. Add a column for decade (e.g., 1990s, 2000s)
2. Filter movies with rating > 8.0
3. Group by genre and calculate average rating

In [None]:
# Your code here
movies_data = {
    'Title': ['The Shawshank Redemption', 'The Godfather', 'Pulp Fiction', 'Inception', 'Parasite'],
    'Year': [1994, 1972, 1994, 2010, 2019],
    'Rating': [9.3, 9.2, 8.9, 8.8, 8.6],
    'Genre': ['Drama', 'Crime', 'Crime', 'Sci-Fi', 'Thriller']
}

movies_df = pd.DataFrame(movies_data)
print("Movies DataFrame:")
print(movies_df)

# Add decade column
movies_df['Decade'] = (movies_df['Year'] // 10) * 10
movies_df['Decade'] = movies_df['Decade'].astype(str) + 's'
print("\nWith decade column:")
print(movies_df)

# Filter high-rated movies
high_rated = movies_df[movies_df['Rating'] > 8.0]
print("\nHigh-rated movies (>8.0):")
print(high_rated)

# Group by genre
genre_avg = movies_df.groupby('Genre')['Rating'].mean()
print("\nAverage rating by genre:")
print(genre_avg)

### Exercise 3: Data Visualization
Using the movies DataFrame, create:
1. A bar plot of ratings by movie title
2. A histogram of movie years
3. A scatter plot of year vs rating

In [None]:
# Your code here
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Bar plot of ratings
axes[0].bar(range(len(movies_df)), movies_df['Rating'], color='gold', edgecolor='black')
axes[0].set_title('Movie Ratings')
axes[0].set_xlabel('Movie')
axes[0].set_ylabel('Rating')
axes[0].set_xticks(range(len(movies_df)))
axes[0].set_xticklabels(movies_df['Title'], rotation=45, ha='right')

# Histogram of years
axes[1].hist(movies_df['Year'], bins=5, color='lightcoral', edgecolor='black', alpha=0.7)
axes[1].set_title('Distribution of Movie Years')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Frequency')

# Scatter plot
axes[2].scatter(movies_df['Year'], movies_df['Rating'], color='navy', s=100, alpha=0.7)
axes[2].set_title('Year vs Rating')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Rating')

plt.tight_layout()
plt.show()

## 10. Summary and Next Steps

Congratulations! You've completed your first data science notebook. Here's what you learned:

✅ **Environment Setup**: Verified Python and essential libraries  
✅ **Jupyter Notebooks**: Learned to navigate and use notebooks effectively  
✅ **Python Fundamentals**: Reviewed key concepts for data science  
✅ **NumPy**: Created and manipulated arrays efficiently  
✅ **Pandas**: Worked with DataFrames for data analysis  
✅ **Matplotlib**: Created basic visualizations  
✅ **Data Analysis**: Performed your first complete analysis  

### Next Week Preview: Data Manipulation and EDA
- Advanced Pandas operations
- Data cleaning techniques
- Exploratory Data Analysis (EDA)
- Advanced visualizations with Seaborn

### Homework Assignment
1. Complete any unfinished exercises above
2. Create a DataFrame with data about yourself and 4 friends (name, age, favorite color, hobby)
3. Perform basic analysis and create 2 different visualizations
4. Write a short summary of what you discovered

### Additional Resources
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [NumPy Documentation](https://numpy.org/doc/)
- [Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)

Great job getting started with data science! 🎉