<a href="https://colab.research.google.com/github/chetnaarora93/Data-Analysis-Project/blob/main/YULU_casestudy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Yulu is India’s leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.

# **IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.mosaicplot import mosaic
from scipy import stats
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency
from scipy.stats import f_oneway
from scipy.stats import probplot
from scipy.stats import levene
from scipy.stats import kruskal

# **LOADING THE DATASET**

In [None]:
df=pd.read_csv("https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/428/original/bike_sharing.csv?1642089089")
df

**ANALYSING BASIC METRICS**

In [None]:
df.shape

The data consists of 10886 rows and 12 columns.**bold text**

In [None]:
df.info()

There is one column 'datetime' of object type, rest are numerical.

There is no nulls in the data.

In [None]:
df.describe()

# **CHECKING NULL VALUES**

In [None]:
df.isna().sum()

There are no null values.

**UNIVARITE ANALYSIS**

In [None]:
df["holiday"].value_counts()

In [None]:
df["workingday"].value_counts()

In [None]:
df["season"].value_counts()

The count of data in the various categories are:

1.Holiday

10575 days are not holiday.

311 days are holiday.

2.Season

2686 days are of spring

2733 days are of summer

2733 days are of fall

2734 days are of winter

3.Working day

Number of non working days is 3474

Number of working days is 7412

**Frequency Distribution of Columns using Histogram**

In [None]:
# plotting histogram to analyse the distribution of numerical columns
df['datetime'] = pd.to_datetime(df['datetime'])
fig, axis = plt.subplots(nrows=2, ncols=4, figsize=(15,8))
sns.histplot(df['registered'], ax=axis[0,0], kde=True)
sns.histplot(df['count'], ax=axis[0,1], kde=True)
sns.histplot(df['temp'], ax=axis[0,2], kde=True)
sns.histplot(df['atemp'], ax=axis[0,3], kde=True)
sns.histplot(df['humidity'], ax=axis[1,0], kde=True)
sns.histplot(df['windspeed'], ax=axis[1,1], kde=True)
sns.histplot(df['casual'], ax=axis[1,2], kde=True)

sns.histplot(df['datetime'], ax=axis[1,3], kde=True)
plt.xticks(rotation=45)

plt.show()

Columns temp, atemp and humidity follow the Normal Gaussian Distribution.

Columns casual, registered and count look like Log Normal Distribution.

windspeed column follows the binomial distribution.

datetime columns is a flat distribution.



# ** Outlier detection using Boxplot**

In [None]:
# plotting box plots to detect outliers in the data

columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual',
'registered','count']
fig, axis = plt.subplots(nrows=2, ncols=3, figsize=(12,6))
index = 0
for row in range(2):
  for col in range(3):
    sns.boxplot(x=df[columns[index]], ax=axis[row, col])
    index += 1
plt.show()

plt.figure(figsize=(3.5,3))
sns.boxplot(x=df[columns[-1]])
plt.show()

The columns count, registered, casual, windspeed have a larger number of outliers. Humidity has comparatively less outlier.

The data in temp, atemp and humidity is symmetric, but in windspeed, casual, registered and count, it is skewed.

The mean temperature is 20 degrees, mean humidity is around 61 and the mean windspeed is around 12 kmph.

The mean count of cycles rented is 191.

The mean number of registered users is 155, and mean number of casual users is 36.

# **BI-VARIATE ANALYSIS**

# **Realtion between Categorical Columns using Mosaic Plot**

In [None]:
# Mosaic Plot to show the relations between various categorical columns in the data

fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(12, 6))

mosaic(df, ['weather', 'holiday'], ax=axis[0,0])
axis[0,0].set_title('Weather Holiday Relation')
axis[0,0].set_xlabel('Weather')
axis[0,0].set_ylabel('Holiday')

mosaic(df, ['season', 'holiday'], ax=axis[1,0])
axis[1,0].set_title('Season Holiday Relation')
axis[1,0].set_xlabel('Season')
axis[1,0].set_ylabel('Holiday')

mosaic(df, ['weather', 'workingday'], ax=axis[0,1])
axis[0,1].set_title('Weather Workingday Relation')
axis[0,1].set_xlabel('Weather')
axis[0,1].set_ylabel('Workingday')

mosaic(df, ['season', 'workingday'], ax=axis[1,1])
axis[1,1].set_title('Season Workingday Relation')
axis[1,1].set_xlabel('Season')
axis[1,1].set_ylabel('Workingday')

plt.tight_layout()
plt.show()

The maximum number of holidays are during the Clear, Few clouds or partly cloudy weather, and the least is during the heavy Rain and Ice Pallets , Thunderstorm and Mist, Snow and Fog.

Same is for working days, but the distribution of working and non working day is more prominent.

The distribution of holiday is roughly same for all seasons.

Same is for the working days, but the number of working days and non working days are roughly in a 1:3 ratio.

# **Dependencies between numerical attributes and count using Scatter Plot**

In [None]:
# Scatter plot for relation between Count and other factors

columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual',
'registered','count']
fig, axis = plt.subplots(nrows=2, ncols=3, figsize=(12, 6))
index = 0
for row in range(2):
  for col in range(3):
    sns.scatterplot(data=df, x=columns[index], y='count', hue= 'count', ax=axis[row, col])
    index += 1
plt.show()

Maximum vehicles are rented when the temperature is between 15 and 30.

Most of the vehicles are rented when **humidity is between 20 to 80. Less than 20, very less are rented.

Maximum vehicles rented are when the windspeed is around than 15-25,after 35, number of vehicles rented is less.

Maximum density of casual users lie between 50 and 150. There are some around 300.

The maximum registered users are more than 800.

# **HYPOTHESIS TESTING**

1. Two- Sample t-test to check if Working Day has an effect on the number of electric cycles rented

H0: Working day has no effect on the number of cycles rented.

Ha: Working day has effect on the number of cycles rented.

Significance level(alpha): 0.05

Before conducting the two-sampled t-test, we have to check whether the data groups have the same variance. The ratio of the larger group to the smaller data group needs to be 4:1. If so, we can conclude that the data have equal variance, otherwise not .

In [None]:
# cheking for equality of variance

w_day = df[df['workingday']==0]['count'].values
non_wday = df[df['workingday']==1]['count'].values

print('Variance for working days:',np.var(w_day))
print('variance for non working days:',np.var(non_wday))

print('Ratio is:',np.var(non_wday)// np.var(w_day))


In [None]:
#performing t test

stats.ttest_ind(a=w_day, b=non_wday, equal_var=True)

Since p_value is greater than 0.05, we cannot reject the Null hypothesis, and we don't have sufficient evidence to say that working day has any effect on the number of cycles being rented

2.Chi-square test

I. to check if Weather is dependent on the season

H0: Weather is not dependent on Season

Ha: Weather is dependent on Season

Significance Level(alpha): 0.05

In [None]:
# checking the above hypothesis at a significance level of 0.05

# creating a contingency table
c_table = pd.crosstab(df['season'], df['weather'])

# chi-square test
chi2, p_value ,_,_ = chi2_contingency(c_table)

print(f'Chi-square statistic: {chi2}')
print(f'P-value: {p_value}')

# setting the significance level and interpreting the result
alpha = 0.05
if p_value < alpha:
    print("Weather is dependent on the season.")
else:
    print("Weather is not dependent on the season.")

The test shows that the Weather is dependent on season.

II. Chi-square test to check if Working day is dependent on the weather

H0: Working day is not dependent on Weather

Ha: Working day is dependent on Weather

Significance Level(alpha): 0.05

In [None]:
# checking the hypothesis at a significance level of 0.05

# creating a contingency table
c_table = pd.crosstab(df['workingday'], df['weather'])

# chi-square test
chi2, p_value ,_,_ = chi2_contingency(c_table)

print(f'Chi-square statistic: {chi2}')
print(f'P-value: {p_value}')

# setting the significance level and interpreting the result
alpha = 0.05
if p_value < alpha:
    print("Working day is dependent on the weather.")
else:
    print("Wworking day is not dependent on the weather.")

The chi-square test shows that working day is dependent on weather.

# **3.ANNOVA to check if No. of cycles rented is similar or different in different 1. weather 2. season**

ASSUMPTIONS OF ANOVA

I.Check For Normality using QQ Plots

In [None]:
# Q-Q Plot for each group of Weather

# Grouping Weather
w_group = [df['count'][df['weather'] == i] for i in df['weather'].unique()]

# Creating Q-Q plots
plt.figure(figsize=(9,6))
for i, group in enumerate(w_group):
    plt.subplot(2, 2, i + 1)
    probplot(group, plot=plt)
    plt.title(f'Q-Q Plot - Weather Group {i + 1}')

plt.tight_layout()
plt.show()


4.Kruskal-Walli's Test to check if No. of cycles rented is similar or different in different 1. weather 2. season

For weather

H0: Number of cycles rented is not different in different weather

Ha: Number of cycles rented is different in different weather

For Season

H0: Number of cycles rented is not different in different seasons

Ha: Number of cycles rented is different in different seasons

Significance level(alpha): 0.05

In [None]:
#  performing Kruskal's-Walli's Test to test the above hypothesis

# Creating groups for Weather
w_groups = [df['count'][df['weather'] == i] for i in df['weather'].unique()]

# Perform Kruskal-Wallis test
stat_weather, p_value_weather = kruskal(*w_groups)

print(f'Kruskal-Wallis Test for Weather:')
print(f'Statistic: {stat_weather}')
print(f'P-value: {p_value_weather}')

if (p_value_weather<0.05):
  print("Number of cycles rented is different in different weathers.")
else:
  print("Number of cycles rented is not different in different weathers.")

# Creating groups for Season
s_groups = [df['count'][df['season'] == i] for i in df['season'].unique()]

# Perform Kruskal-Wallis
stat_season, p_value_season = kruskal(*s_groups)

print('\nKruskal-Wallis Test for Season:')
print(f'Statistic: {stat_season}')
print(f'P-value: {p_value_season}')

if (p_value_season<0.05):
  print("Number of cycles rented is different in different seasons.")
else:
  print("Number of cycles rented is not different in different seasons.")


The above test shows that the number of cycles rented is different in different weather and seasons.

# **INSIGHTS**

1.The mean count of cycles rented is 191. and the 50th percentile is 145.

2.The mean number of registered users is 155 and the mean number of casual users is 36. Thus, more cycles are rented by registered users.

3.The number of cycles rented is more in summer and fall season as compared to others.

4.More cycles are rented during holidays.

5.During non working days, more cycles are rented.

6.Most of the cycles are rented when the weather is Clear, Few clouds or partly cloudy.

7.And very less cycles are rented during rain, thunderstorm, snow or fog.

8.There is a negative correlation between count and humidity and weak positive correlation between count and windspeed. This means that more cycles are rented when the humidity is medium to less and when windspeed is average.

9.There is medium positive correlation between count and temp and count and atemp. So, more cycles are rented during moderate temperatures.

10.Also, very less cycles are rented during low temperatures.

# **RECOMMENDATIONS**

1.Promotion During Summer and Fall:

As more cycles are rented during summer and fall, promotional campaigns should be conducted or special offers should be given during these seasons to attract more users.

2.Target Registered Users:

Also since, more cycles are rented by registered users, focus should be on loyalty programs, as to encourage user registration and usage.

3.Holiday and Non-Working Day Promotions:

During the increased demand during holidays and non-working days, investment should be done more by running promotions or events.

4.Weather-Based Marketing:

Marketing efforts should be aligned with weather conditions. Special deals and discounts should be made during the adverse weather conditions to encourage usage.

5.Optimize Operations Based on Weather:

Adjusting the number of available cycles based on weather forecasts. During favorable weather, an adequate supply of cycles should be available, and during adverse weather, focus should be on alternative services.

6.Correlation with Humidity and Windspeed:

Since the cycles rented during less humidity is more, information or incentives for users should be provided to rent cycles during periods of moderate humidity. The weak positive correlation should be levereged with windspeed by promoting cycling during average windspeed days.

7.Temperature-Responsive Services:

As there is a medium positive correlation between cycle rentals and temperature, tailor services and promotions should be aligned with moderate temperatures, which will attract more users.

8.Address Low-Temperature Challenges:

Ways to promote cycling during low temperatures should be investigated. Like seasonal promotions, providing appropriate gear, or creating events that make cycling appealing even in colder weather.