# Sample Exam: Python Associate Exam - VoltBike Innovations

VoltBike Innovations is a leading company in the electric bicycle (e-bike) industry, specializing in the design and manufacture of high-performance e-bikes. The company is dedicated to advancing urban mobility solutions by delivering state-of-the-art e-bikes with features such as varying motor powers, advanced battery capacities, and efficient charge systems.

Recently, VoltBike Innovations has encountered some challenges in managing production costs while ensuring high levels of customer satisfaction. These issues have led to increased production expenses and variability in costs, impacting overall profitability.

You are part of the data analysis team tasked with providing actionable insights to help VoltBike Innovations address these challenges.

# Task 1

Before you can start any analysis, you need to confirm that the data is accurate and reflects what you expect to see. 

It is known that there are some issues with the `production_data` table, and the data team have provided the following data description. 

Write a query to return data matching this description. You must match all column names and description criteria.
</br>
Create a cleaned version of the dataframe.

- You should start with the data in the file `ebike_data.csv`.
- Your output should be a dataframe named clean_data.
- All column names and values should match the table below.
</br>

| Column Name         | Criteria                                                                                         |
|----------------------|--------------------------------------------------------------------------------------------------|
| bike_type            | Categorical. Type of e-bike. ['standard', 'folding', 'mountain', 'road']. <br> Missing values should be replaced with 'standard'. |
| frame_material       | Categorical. Material of the e-bike frame. ['aluminum', 'steel', 'carbon fiber']. <br> Missing values should be replaced with 'unknown'. |
| production_cost      | Continuous. Cost of production (in USD). <br> Missing values should be replaced with median. |
| assembly_time        | Continuous. Time taken for assembly (in minutes). <br> Missing values should be replaced with mean. |
| top_speed            | Continuous. Maximum speed of the e-bike (in km/h). <br> Missing values should be replaced with mean. |
| battery_type         | Categorical. Type of battery used. ['li-ion', 'nimh', 'lead acid']. <br> Missing values should be replaced with 'other'. |
| motor_power          | Continuous. Power output of the motor (in watts). <br> Missing values should be replaced with median. |
| customer_score       | Continuous. Customer satisfaction score (rating on a scale of 1 to 10). <br> Missing values should be replaced with mean. |



In [30]:
# Task 1
import pandas as pd
import numpy as np
df = pd.read_csv("ebike_data.csv")
clean_data = df.copy()

# bike_type
clean_data['bike_type'] = clean_data['bike_type'].astype('category')

# frame_material
clean_data['frame_material'] = clean_data['frame_material'].str.lower()
clean_data['frame_material'] = clean_data['frame_material'].astype('category')

# production_cost: no action necessary
# assembly_time: no action necessary

# top_speed
top_speed_mean = clean_data['top_speed'].mean()
clean_data.loc[clean_data['top_speed'].isna(), 'top_speed'] = top_speed_mean
clean_data['top_speed'] = clean_data['top_speed'].round(2)

# battery_type
clean_data['battery_type'] = clean_data['battery_type'].replace({'-': 'other'})
clean_data['battery_type'] = clean_data['battery_type'].astype('category')

# motor_power
clean_data['motor_power'] = clean_data['motor_power'].str.replace('W', '')
clean_data['motor_power'] = clean_data['motor_power'].astype(int)

# customer_score
# everything above 10, replace with np.nan and then with 10
clean_data.loc[clean_data['customer_score'] > 10, 'customer_score'] = 10
clean_data['customer_score'] = clean_data['customer_score'].round().astype(int)
#print(clean_data.dtypes)
#print(clean_data.describe())
#print(clean_data.info())

# Task 2

You want to understand how different types of e-bikes influence production costs, assembly times, and customer satisfaction.

Calculate the average production_cost, assembly_time, and customer_score grouped by bike_type.

- You should start with the data in the file `ebike_data.csv`.
- Your output should be a data frame named `bike_type_data`.
- It should include the four columns:`bike_type`, `avg_production_cost`, `avg_assembly_time`, and `avg_customer_score`.
- Your answers should be rounded to 2 decimal places.

In [31]:
# Task 2
df_task2 = pd.read_csv("ebike_data.csv")
#calculate means
bike_type_data = df_task2.groupby('bike_type')[['production_cost', 'assembly_time', 'customer_score']].mean().reset_index()
# change column names and round
dict_names = {'production_cost': 'avg_production_cost', 
              'assembly_time': 'avg_assembly_time', 
              'customer_score': 'avg_customer_score'}
bike_type_data = bike_type_data.rename(dict_names, axis='columns').round(2)
print(bike_type_data)

  bike_type  avg_production_cost  avg_assembly_time  avg_customer_score
0   folding               499.72              61.40                6.46
1  mountain               507.02              59.79                6.52
2      road               503.02              61.19                6.56
3  standard               489.85              59.81                6.50


# Task 3

In order to proceed with further analysis, you need to understand how key production and satisfaction factors relate to each other. Start by calculating the mean and standard deviation for the following columns: `production_cost` and `customer_score`. These statistics will help in understanding the central tendency and variability of the data related to e-bike production and customer feedback.

Next, calculate the Pearson correlation coefficient between `production_cost` and `customer_score`. This correlation coefficient will provide insights into the strength and direction of the relationship between production costs and customer satisfaction.

- You should start with the data in the file `ebike_data.csv`.
- Calculate the mean and standard deviation for the columns `production_cost` and `customer_score` as: `production_cost_mean`, `production_cost_sd`, `customer_score_mean`, and `customer_score_sd`.
- Calculate the Pearson correlation coefficient between `production_cost` and `customer_score` as `corr_coef`.
- Your output should be a data frame named bike_analysis.
- It should include the columns: `production_cost_mean`, `production_cost_sd`, `customer_score_mean`, `customer_score_sd`, and `corr_coef`.
- Ensure that your answers are rounded to 2 decimal places.


In [32]:
# Task 3
df_task3 = pd.read_csv("ebike_data.csv")
# calcucate mean and standard deviation
production_cost_mean = df_task3['production_cost'].mean()
production_cost_sd = df_task3['production_cost'].std()
customer_score_mean = df_task3['customer_score'].mean()
customer_score_sd = df_task3['customer_score'].std()
#calculate correlation coefficient
corr_coef = df_task3[['production_cost', 'customer_score']].corr().iloc[0,1]
# create dataframe
data_dict = {'production_cost_mean':production_cost_mean, 
             'production_cost_sd':production_cost_sd, 
             'customer_score_mean':customer_score_mean, 
             'customer_score_sd':customer_score_sd, 
             'corr_coef': corr_coef}
bike_analysis = pd.DataFrame(data_dict, index=[0]).round(2)
print(bike_analysis)

   production_cost_mean  production_cost_sd  ...  customer_score_sd  corr_coef
0                 500.0              173.34  ...               1.63       0.48

[1 rows x 5 columns]
