In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The questions in this notebook were generated by chatgpt

III. Question: Data Cleaning and Transformation

You have a dataset containing daily website traffic information with columns: date, user_id, country, and page_views. Some rows have missing values in the country column, and the page_views column contains negative values due to data entry errors.

1.	Remove rows with missing country values.
2.	Replace negative page_views values with zero.
3.	Calculate the average number of page views per user per country.

Expected Output:
A new DataFrame with columns country, average_page_views.


In [93]:
# Sample data for website traffic
data = {
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', \
    '2024-01-04', '2024-01-05',  '2024-01-06',  '2024-01-06'],
    'user_id': [1, 2, 3, 4, 5, 6, 7,8],
    'country': ['US', 'IN', None, 'BR', 'IN', 'US', 'BR', 'IN'],
    'page_views': [5, -3, 8, -1, 14, 2, 6, 10]
}

website_traffic = pd.DataFrame(data)
website_traffic

Unnamed: 0,date,user_id,country,page_views
0,2024-01-01,1,US,5
1,2024-01-02,2,IN,-3
2,2024-01-03,3,,8
3,2024-01-04,4,BR,-1
4,2024-01-04,5,IN,14
5,2024-01-05,6,US,2
6,2024-01-06,7,BR,6
7,2024-01-06,8,IN,10


In [94]:
# 1) Remove rows with missing country values 

aux = website_traffic[ website_traffic['country'].notna() ]

# 2)	Replace negative page_views values with zero.

aux.loc [aux['page_views'] < 0,'page_views'  ] = 0
aux
# 3)	Calculate the average number of page views per user per country.

aux.groupby('country').agg( average_views = ('page_views', 'mean'))

Unnamed: 0_level_0,average_views
country,Unnamed: 1_level_1
BR,3.0
IN,8.0
US,3.5



II. Question: Time Series Analysis

You have a dataset sales with columns: date and sales_amount. You are asked to forecast the next 3 months’ sales based on the data provided.

1.	Convert the data to a time series format.
2.	Decompose the time series to identify trends and seasonality.
3.	Create a simple model (e.g., moving average) to forecast the next 3 months of sales.

Expected Output:
A table with the forecasted sales for the next 3 months.



In [95]:
# Sample data for sales

# Generate date range for 36 months starting from January 2022

n= 36
dates = pd.date_range( start='2022-01-01', periods=n, freq = 'M')

# Generate random sales amounts 
sales = np.random.normal(
    loc = 250,  # mean
    scale =15,  # std
    size = n
).round()

# Create the DataFrame
sales = pd.DataFrame({
    'date':dates,
    'sales_amount': sales
})

  dates = pd.date_range( start='2022-01-01', periods=n, freq = 'M')


In [None]:
# 1. Convert the data to a time series format.


sales['date'] = pd.to_datetime(sales['date'])

sales.set_index('date', inplace = True) 

# 2. Decompose the time series to identify trends and seasonality.
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose( 
    sales['sales_amount'],
    model = 'additive',
    period =12)

result.plot()
plt.show()

In [96]:
# 3.	Create a simple model (e.g., moving average) to forecast the next 3 months of sales.

sales.reset_index(inplace=True)
last_3 = sales['sales_amount'].tail(3).to_list()

for i in range (0,3):
    last_3.append( np.mean(last_3[i:len(last_3)]) )

last_3 = last_3[-3:]

# Generate the next 3 months
next_months = pd.date_range( 
    start = '2024-01-01',
    periods = 3,
    freq = 'M' )

# create new rows:
next_months = next_months.to_list()

# create auxiliary dataframe for concat

aux = pd.DataFrame({
    'date': next_months,
    'sales_amount': last_3
})

result = pd.concat([sales, aux], ignore_index=True)
result.tail(10)



  next_months = pd.date_range(


Unnamed: 0,date,sales_amount
29,2024-06-30,229.0
30,2024-07-31,260.0
31,2024-08-31,265.0
32,2024-09-30,263.0
33,2024-10-31,232.0
34,2024-11-30,274.0
35,2024-12-31,227.0
36,2024-01-31,244.333333
37,2024-02-29,248.444444
38,2024-03-31,239.925926



III. Question: Feature Engineering

You have a dataset customer_data with columns: customer_id, age, gender, annual_income, and purchase_amount. You need to prepare the data for a machine learning model.

1.	Create a new feature called income_per_age (annual income divided by age).
2.	Encode the gender column as a binary variable.
3.	Normalize the purchase_amount column using min-max scaling.

Expected Output:
A new DataFrame with the additional feature and transformations applied.


In [None]:
# Sample data for customer data
data = {
    'customer_id': [1, 2, 3, 4],
    'age': [25, 30, 45, 50],
    'gender': ['M', 'F', 'M', 'F'],
    'annual_income': [50000, 60000, 80000, 120000],
    'purchase_amount': [2000, 3000, 4000, 5000]
}

customer_data = pd.DataFrame(data)


4. Question: Exploratory Data Analysis

Given a dataset employee_data with columns: employee_id, department, salary, and years_at_company, analyze the data to answer the following questions:

	1.	What is the average salary by department?
	2.	What is the correlation between salary and years_at_company?
	3.	Which department has the highest employee turnover?

Expected Output:
Descriptive statistics and a brief summary of findings.


In [None]:
# Sample data for employee data
data = {
    'employee_id': [1, 2, 3, 4, 5],
    'department': ['HR', 'Finance', 'IT', 'Finance', 'IT'],
    'salary': [60000, 75000, 50000, 80000, 90000],
    'years_at_company': [5, 7, 3, 10, 4]
}

employee_data = pd.DataFrame(data)


#### V. Question: Hypothesis Testing

You are given a dataset experiment_data with columns: group (control or treatment), conversion_rate, and user_id. You need to test whether the treatment group has a significantly higher conversion rate than the control group.

1.	Formulate the null and alternative hypotheses.
2.	Perform a t-test to compare the conversion rates.
3.	Report the p-value and your conclusion.

Expected Output:
The p-value and a decision on whether to reject the null hypothesis.


In [None]:
# Sample data for experiment data
data = {
    'user_id': range(1, 101),
    'group': ['control'] * 50 + ['treatment'] * 50,
    'conversion_rate': [0.02, 0.03, 0.02, 0.01, 0.02] * 10 + 
    [0.04, 0.05, 0.04, 0.03, 0.04] * 10
}

experiment_data = pd.DataFrame(data)


#### VI. Question: Data Aggregation and Summarization

You have two datasets: orders (with columns: order_id, product_id, customer_id, order_date, quantity) and products (with columns: product_id, product_name, price).

1.	Merge the datasets to include product details in the orders.
2.	Calculate the total revenue generated by each product.
3.	Find the top 3 products with the highest revenue.

Expected Output:
A DataFrame with the top 3 products, including product name and total revenue.


In [None]:
# Sample data for orders
orders_data = {
    'order_id': [1, 2, 3, 4, 5],
    'product_id': [101, 102, 103, 101, 104],
    'customer_id': [1, 1, 2, 2, 3],
    'order_date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
    'quantity': [2, 3, 5, 1, 4]
}

orders = pd.DataFrame(orders_data)

# Sample data for products
products_data = {
    'product_id': [101, 102, 103, 104],
    'product_name': ['Product A', 'Product B', 'Product C', 'Product D'],
    'price': [10, 20, 15, 25]
}

products = pd.DataFrame(products_data)


7. Question: Handling Categorical Data

You have a dataset survey_responses with columns: respondent_id, satisfaction (scale of 1 to 5), feedback, and age_group (categories: ‘Under 18’, ‘18-35’, ‘36-50’, ‘50+’).

	1.	Convert the age_group into dummy variables.
	2.	Calculate the average satisfaction score for each age group.
	3.	Create a word cloud of the feedback text.

Expected Output:

	1.	A transformed DataFrame with dummy variables.
	2.	A summary table of average satisfaction by age group.
	3.	A visualization (conceptual explanation in Google Docs).


In [None]:

# Sample data for survey responses
data = {
    'respondent_id': [1, 2, 3, 4, 5],
    'satisfaction': [3, 4, 5, 2, 4],
    'feedback': ['Good', 'Very good', 'Excellent', 'Poor', 'Good service'],
    'age_group': ['18-35', '36-50', '18-35', 'Under 18', '50+']
}

survey_responses = pd.DataFrame(data)


8. Question: Statistical Modeling

You have a dataset student_scores with columns: student_id, study_hours, test_score. You are asked to create a simple linear regression model to predict test_score based on study_hours.

	1.	Split the data into training and testing sets.
	2.	Fit a linear regression model.
	3.	Evaluate the model using R-squared and RMSE.

Expected Output:
The model coefficients, R-squared, and RMSE values.


In [None]:
import pandas as pd

# Sample data for student scores
data = {
    'student_id': [1, 2, 3, 4, 5],
    'study_hours': [5, 6, 8, 3, 7],
    'test_score': [70, 75, 80, 65, 78]
}

student_scores = pd.DataFrame(data)


9. Question: Clustering Analysis

You have a dataset customer_behavior with columns: customer_id, purchase_frequency, average_spent, and loyalty_score. You are asked to segment the customers into distinct groups.

	1.	Standardize the numerical features.
	2.	Apply K-means clustering with an appropriate number of clusters.
	3.	Describe the characteristics of each cluster.

Expected Output:
A summary table describing the characteristics of each cluster.


In [None]:

# Sample data for customer behavior
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'purchase_frequency': [5, 3, 4, 2, 6],
    'average_spent': [200, 150, 180, 100, 220],
    'loyalty_score': [8, 5, 7, 4, 9]
}

customer_behavior = pd.DataFrame(data)


10. Question: Data Visualization and Interpretation

You are given a dataset sales_over_time with columns: date, region, and sales_amount. You are asked to visualize and interpret the sales trend.

	1.	Create a line plot showing sales over time, with separate lines for each region.
	2.	Identify any regions that show significant growth or decline.
	3.	Provide a brief analysis of the trends observed.

Expected Output:
A line plot and a written summary of the trends and insights.

These questions cover a range of data science tasks, including data cleaning, transformation, statistical analysis, modeling, and visualization. Make sure to practice explaining your thought process clearly as you solve them, as this is crucial in the interview setting.

In [None]:
# Sample data for sales over time
data = {
    'date': pd.date_range(start='2024-01-01', periods=12, freq='M'),
    'region': ['North', 'South', 'East', 'West'] * 3,
    'sales_amount': [1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600]
}

sales_over_time = pd.DataFrame(data)

# What else to study? https://www.youtube.com/watch?v=hAqg2dlNeUc&list=LL

### Data structure:
    * Arrays,
    * Hash maps/Dictionary
    * Heaps
    * Seaps
    * Stack/Queues
    * Strings
    * Tree

### Algorithms
    * Binary Search
    * Recursion
    * Sorting
    * Dynamic Programming

### Mathematic and Statistics
1. Simulation
    * Monte Carlo Simulation
    * Simulation a distribution from another distribution
    * Sampling technics
        - Importance, Rejection, Inverse, Weighted sampling

2. Other
    * Divisibility of natural numbers
    * Euclidean algorithm: greatest common divisor
3. Machine Learning
    * Coding ML from scratch
    - Decision Tree
    - Linear and Logistic Regression
    - K-nearest Neighbors
    - K-means clustering




### Examples
1) Find the median of an unsorted array
2) Simulating a multinomial distribution using uniform random numbers
3) Enumerating all prime numbers up to a given natural number N
 