In [1]:
import warnings
warnings.filterwarnings('ignore')

> # **About Yulu: 🚲**


- Yulu Founded in 2017, Transforming urban communities.

- Yulu is India’s leading micro-mobility service provider, which offers unique vehicles for the daily commute. Starting off as a mission to eliminate traffic congestion in India, Yulu provides the safest commute solution through a user-friendly mobile app to enable shared, solo and sustainable commuting.

- Yulu zones are located at all the appropriate locations (including metro stations, bus stands, office spaces, residential areas, corporate offices, etc) to make those first and last miles smooth, affordable, and convenient!

- Yulu has recently suffered considerable dips in its revenues.

- They have contracted a consulting company to understand the factors on which the demand for these shared electric cycles depends. Specifically, they want to understand the factors affecting the demand for these shared electric cycles in the Indian market.

># **Business Problem:**

- Which variables are significant in predicting the demand for shared electric cycles in the Indian market ?

- How well those variables describe the electric cycle demands.

> # 📃 **Features of the dataset:**
- Column Profiling:

 **Column Profiling:**

| Feature | Description |
|:--------|:------------|
|datetime| datetime|  
|season| season (1: spring, 2: summer, 3: fall, 4: winter)|
|holiday| whether day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)|
|workingday| if day is neither weekend nor holiday is 1, otherwise is 0.|
|temp| temperature in Celsius|
|atemp| feeling temperature in Celsius|
|humidity| humidity|
|windspeed| wind speed|
|casual| count of casual users|
|registered| count of registered users|
|count - Total_riders| count of total rental bikes including both casual and registered|

- weather : Feature with multiple categories, mentioned below:

|Category|Details|
|:------|:--------|
|1| Clear, Few clouds, partly cloudy, partly cloudy|
|2| Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist|
|3| Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds|
|4| Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog|
    


## Importing necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# %matplotlib inline
import seaborn as sns

from scipy.stats import norm,zscore,boxcox,probplot
from statsmodels.stats import weightstats as stests
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import ttest_ind,ttest_rel,ttest_1samp,mannwhitneyu
from scipy.stats import chisquare,chi2,chi2_contingency
from scipy.stats import f_oneway,kruskal,shapiro,levene,kstest
from scipy.stats import pearsonr,spearmanr
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler


In [None]:
yulu_data = pd.read_csv('Downloads/bike_sharing.csv')

# **Basic Analysis:**

> ### Make copy of data and do analysis: 

In [None]:
data = yulu_data.copy()

> ### Shape of the data

In [None]:
print('Number of Observations in yulu data:', data.shape[0])
print('Number of Features in yulu data:', data.shape[1])

>### Data-type of all attributes or concise summary

In [None]:
data.info()

**Inference:**

* There are no missing values in the dataframe.
* All the columns execpt datetime are numerical columns.

* 'datetime' attribute is of object type, we should convert it to datetime type.
* There are various categorical columns (season, holiday, workingday, weather) that appear to be integer type, we should convert them to categorical type.

In [None]:
data.duplicated().sum()

#### Insights:

- There are 10886 rows and 12 columns in our data.
- There are no null values.
- There are also no duplicate values.
- The columns "datetime" are in object datatype which need to be corrected.
- The columns "season", "holiday", "workingday", "weather", "humidity", "casual", "registered" and "total_riders" are in int datatype.
- The columns "temp", "atemp", and "windspeed" are in float datatype.

> ### Columns in a Dataframe

In [None]:
data.columns

## **Skewness Analysis**

In [None]:
data.skew(numeric_only = True)

#### Insights:

* Symmetrical Majority or Zero skewness:
    - Variables, like season and temp, are having skewness values close to zero, indicating ther is relatively symmetrical distributions.
>   
* Positive Skewness Insights:
    - Variables such as holiday, weather, windspeed, casual, registered, and bike_count demonstrate positive skewness, pointing to a concentration of lower values and a right skewed in their distributions.   
> 
* Negative Skewness Observations:
    - Variables like workingday, atemp, and humidity exhibit negative skewness, implying a concentration of higher values and a left skewed in their distributions.


> ### **Statistical Summary:**

In [None]:
data.describe(include= 'object').T

### Statistical Summary of Numeric columns

In [None]:
data.describe().T

#### Insights:

* We don't see a significant difference between mean and median for the independent variables such as temp, atemp, humidity and windspeed.
* This suggests a lesser presence of outliers or skewness for these variables, but we can't be sure. We will further check for outliers using box plot.
* It's difficult to comment on distribution of the dependent variables, count of casual and registered users leverging the shared mobility service. * We will revisit it through visual plots.

## Exploratory data analysis:

> ### Non-Graphical Analysis: Value counts and unique attributes

In [None]:
data.nunique()

> ### Value Counts in each features:

In [None]:
for col in data.columns:
    value_count=data[col].value_counts(normalize=True)*100
    print(f"----Value counts of {col} column ---- ")
    print()
    print(value_count.round(2))
    print()
    print()

# Data Preprocessing:

> ### Handling missing values

In [None]:
data.isna().sum()

In [None]:
plt.figure(figsize=(20,6))

plt.style.use('dark_background')
sns.heatmap(data.isnull(), cmap='Purples')
plt.title('Visual Check for Null value', fontsize = 20, color = 'green')
plt.show()

**Inference:**

* There are no missing values in a given dataset.

> ### Data-type conversion: 

In [None]:
# Converting datetime column to type datetime from object type:
data['datetime'] = pd.to_datetime(data.datetime)

In [None]:
# Converting season column from numerical to categorical:
def season_type(x):
    if x==1:
        return 'spring'
    elif x==2:
        return 'summer'
    elif x==3:
        return 'fall'
    else:
        return 'winter'

data['season'] = data['season'].apply(lambda x:season_type(x))

In [None]:
# Converting season,weather,holiday and workingday columns into categorical:
data['season']= pd.Categorical(data['season'])
data['weather']=pd.Categorical(data['weather'])
data['holiday']=pd.Categorical(data['holiday'])
data['workingday']=pd.Categorical(data['workingday'])

In [None]:
# Statistical Summary after data-type conversion
print('Statistical Summary of categorical col:',data.describe(include='category').T)
print()
print('-'*100)
print()
print('Statistical Summary of all col:',data.describe().T)

> ### Handling outliers

* Outlier detection using the z-score method

* One of method to detect outliers in numeric column using the z-score.
* If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.

**z score= (x-mean)/std.deviation**

In [None]:
outliers={}
for col in data.select_dtypes(include = np.number):
    #finding z-score for each value in a column
    z_score= np.abs((data[col]-data[col].mean()))/data[col].std()
    
    # if the z score of a value is a grater than 3 than the value is outlier
    column_outliers = data[z_score > 3][col]
    
    outliers[col] = column_outliers
    
for col,outlier_values in outliers.items():
    print(f"Outliers for {col} column")
    print(outlier_values)
    print('-'*50)
    print()

**Inferences:**

* There were no outliers present in the 'temp' and 'atemp' column.
* Outliers are evident within the 'humidity' and 'windspeed' columns based on the observations.
* Outliers are noticeable in the counts of casual and registered users, though drawing definite conclusions necessitates analyzing their relationship with independent variables..

# Univariate Analysis:

> ### Plot Boxplot for detecting Outliers:

In [None]:
plt.style.use('ggplot')
num_col = data.select_dtypes(include=np.number)
# Box plot:
for col in num_col:
    plt.subplot()
    sns.boxplot(data = data, x = col)
    plt.title(f'Distribution of {col}',fontname='Franklin Gothic Medium', fontsize=15)
    plt.grid(True)
    plt.show()

**Inference:**

* No outliers are detected in the 'temp' and 'atemp' columns, suggesting that the temperature-related data points fall within the expected range.
* In the 'humidity' column, a single value is identified as an outlier, implying an unusual humidity measurement distinct from the others.
* The 'windspeed' column contains 12 outlier values, indicating instances where wind speed measurements significantly deviate from the typical range.
* The box plot clearly indicates the presence of outliers in the number of casual and registered users.
* The box plot reveal data skewness.
* As we proceed, we will decide whether to address outliers or perform variable transformation. In this case, given the significant number of outliers, variable transformation, specifically Log Transformation, seems to be a more appropriate approach.

> ### Distribution of Working day:

In [None]:
plt.style.use('default')
workingday_df = data.groupby(['workingday']).agg(number_of_cycles_rented=('workingday', 'count')).reset_index()
print(workingday_df)
print()

labels = workingday_df['workingday']
values = workingday_df['number_of_cycles_rented']

plt.figure(figsize=(10, 4))

# Bar plot
plt.subplot(121)
sns.countplot(data=data, x='workingday', palette='viridis')
plt.title('Count of Cycles Rented by Working Day',fontname='Franklin Gothic Medium', fontsize=15)

# Pie chart
plt.subplot(122)
plt.pie(x=values, labels=labels, autopct='%1.1f%%', colors=sns.color_palette('viridis', len(labels)))
plt.title('Proportion of Cycles Rented by Working Day',fontname='Franklin Gothic Medium', fontsize=15)

plt.tight_layout()
plt.show()


**Insight:**

* On working days, 68.1% of cycles are rented, whereas on non-working days, 31.9% of cycles are rented.

> ### Distribution of Season:

In [None]:
data.columns

In [None]:
season_df = data.groupby(['season']).agg(number_of_cycles_rented = ('season', 'count')).reset_index()
print(season_df)
print()

labels = season_df['season']
values = season_df['number_of_cycles_rented']

plt.figure(figsize=(10, 4))

# Bar plot
plt.subplot(121)
sns.countplot(data = data, x='season', palette='viridis')
plt.title('Count of Cycles Rented by for season',fontname='Franklin Gothic Medium', fontsize=15)

# Pie chart
plt.subplot(122)
plt.pie(x = values, labels=labels, autopct='%1.1f%%', colors = sns.color_palette('viridis', len(labels)))
plt.title('Proportion of Cycles Rented by each season',fontname='Franklin Gothic Medium', fontsize=15)

plt.tight_layout()
plt.show()


**Insights:**
* During the fall, summer and winter seasons approximately 25.1% of cycles were rented(respectively).
* The lowest rental rate(In comparison with other seasons), 24.7% of cycles were rented,is observed in the spring season.

> ### Distribution of Weather

In [None]:
weather_df = data.groupby(['weather']).agg(number_of_cycles_rented = ('weather', 'count')).reset_index()
print(weather_df)
print()

labels = weather_df['weather']
values = weather_df['number_of_cycles_rented']

plt.figure(figsize=(12, 4))

# Bar plot
plt.subplot(121)
sns.countplot(data = data, x='weather', palette='viridis')
plt.title('Count of Cycles Rented as per weather condition',fontname='Franklin Gothic Medium', fontsize=15)

# Pie chart
plt.subplot(122)
plt.pie(x = values, labels=labels, autopct='%1.1f%%', colors = sns.color_palette('viridis', len(labels)))
plt.title('Proportion of Cycles Rented as per weather condition',fontname='Franklin Gothic Medium', fontsize=15)

plt.tight_layout()
plt.show()


**Insights:**

* In Weather condition 1 experiences the highest rental rate, with approximately 66% of cycles rented.
* In weather condition 2, around 26.0% of cycles were rented.
* Weather condition 3 has a rental rate of approximately 7.9 % for cycles.
* Weather condition 4 exhibits an exceptionally low rental rate, with only 0.0% of cycles being rented(In count 1).

> ### Hourly Trends in Average Cycle Rentals:

In [None]:
hour_df =data.groupby(data['datetime'].dt.hour).agg(average_cycles_rented = ('count','mean')).reset_index()
hour_df.head()

In [None]:
plt.figure(figsize=(10, 5))
sns.lineplot(data=hour_df, x='datetime', y='average_cycles_rented', marker='o', color='green')
plt.title('Average Cycles Rented Per Hour', fontsize=16, fontweight='bold')
plt.xlabel('Hour of the Day', fontsize=14)
plt.ylabel('Average Cycles Rented', fontsize=14)
plt.grid(visible=True, linestyle='--', linewidth=0.5)
plt.xticks(range(0, 24))  # Assuming hours are 0-23
plt.yticks()
plt.show()

**Insights:**

* The highest average count of rental bikes is observed at 5 PM, closely followed by 5 PM and 7 AM. This indicates distinct peak hours during the day when cycling is most popular.
* Conversely, the lowest average count of rental bikes occurs at 4 AM, with 3 AM and 5 AM also showing low counts. These hours represent the early morning period with the least demand for cycling.
Notably, there is an increasing trend in cycle rentals between 5 AM and 8 AM, suggesting a surge in demand during the early morning hours as people start their day.
* Additionally, there is a decreasing trend in cycle rentals from 5 PM to 11 PM, indicating a gradual decline in demand as the day progresses into the evening and nighttime.

> ### Montly trend in average cycle rentals

In [None]:
month_df = data.groupby(data['datetime'].dt.month_name()).agg(average_cycles_rented = ('count','mean')).reset_index()
month_df.head()

In [None]:
plt.figure(figsize=(10, 5))
sns.lineplot(data = month_df, x ='datetime', y ='average_cycles_rented', marker='o', color='cyan')
plt.title('Average Cycles Rented on monthly basis', fontsize=16, fontweight='bold')
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Cycles Rented', fontsize=14)
plt.grid(visible=True, linestyle='--', linewidth=0.5)
plt.xticks(rotation = 90)  # Assuming hours are 0-23
plt.yticks()
plt.show()

**Insights:**

* The highest average hourly count of rental bikes occurs in June, July, and August, reflecting the peak demand during summer season.
* Conversely, the lowest average hourly count of rental bikes is found in January, February, and March, which are the winter months with reduced cycling activity.
* Notably, there is an increasing trend in average bike rentals from February to June, corresponding to the shift from winter to spring and summer.
Conversely, a decreasing trend in average bike rentals is observed from October to December due to the onset of winter.

# Bivariate Analysis:
> ### Distribution of count of rented bikes across working day

In [None]:
plt.figure(figsize=(15,5))
# KDE plot
plt.subplot(1,2,1)
sns.set_style('darkgrid')
sns.kdeplot(data=data,x='count',hue='workingday',palette = sns.color_palette("dark"))
plt.xlabel('Number of rented bikes')
plt.ylabel('Probablity Density')

# Box plot
plt.subplot(1,2,2)
sns.boxplot(data=data,y='count',x='workingday',palette = sns.color_palette("dark"))
plt.xlabel('Working day')
plt.ylabel('Number of bikes rented')

plt.suptitle('Distribution of number of rented bikes across Working Day')
plt.show()

> ### Distribution of count of rented bikes across Season

In [None]:
plt.figure(figsize=(15,5))
plt.suptitle('Distribution of number of rented bikes across Season')
# KDE plot
plt.subplot(1,2,1)
sns.set_style('darkgrid')
sns.kdeplot(data = data,x = 'count',hue='season', palette = sns.color_palette("tab10"))
plt.xlabel('Number of rented bikes')
plt.ylabel('Probablity Density')

# Box plot
plt.subplot(1,2,2)
sns.boxplot(data=data,y='count',x = 'season',palette=sns.color_palette("tab10"))
plt.xlabel('Season')
plt.ylabel('Number of bikes rented')

plt.show()

>### Distribution of count of rented bikes across Weather types

In [None]:
plt.figure(figsize=(15,5))
plt.suptitle('Distribution of number of rented bikes across Weather types')
# KDE plot
plt.subplot(1,2,1)
sns.set_style('darkgrid')
sns.kdeplot(data=data,x='count',hue='weather',palette=sns.color_palette("dark"))
plt.xlabel('Number of rented bikes')
plt.ylabel('Probablity Density')

# Box plot
plt.subplot(1,2,2)
sns.boxplot(data=data,y='count',x='weather',palette=sns.color_palette("dark"))
plt.xlabel('Weather type')
plt.ylabel('Number of bikes rented')

plt.show()

# Heatmap

In [None]:
data.columns

In [None]:

sns.heatmap(data[['temp','atemp','humidity','windspeed','registered','count','casual']].corr(),annot=True,cmap='rocket',fmt='.2f')
plt.show()

**Insights:**

* The weak positive correlation of 0.39 between temperature and the number of bikes rented suggests that, on average, fewer people prefer to use electric cycles during the daytime between 12 PM to 3 PM. This observation aligns with our univariate analysis, where we discovered that the average number of cycles rented during this time frame was lower compared to other times of the day. A similar correlation pattern is also observed in the case of "feels-like" temperature, reinforcing this trend.
* The negative correlation between humidity and the number of cycles rented indicates that people tend to avoid using electric bikes during high humidity conditions. This avoidance can be attributed to the discomfort caused by the heavy and sticky air, leading to sweating and a general sense of unease. Moreover, the reduced efficiency of electric bikes in high humidity, resulting in increased air resistance and potential battery performance issues, contributes to the preference for alternative transportation or indoor activities in such conditions.
* The presence of a weak positive correlation between windspeed and the number of cycles rented indicates that there is a subset of individuals who appear to favor using electric cycles during windy conditions for the sheer enjoyment of the experience. While this preference contributes to a slight increase in bike rentals on windier days, it's essential to recognize that this effect is not particularly strong, as indicated by the weak correlation. This suggests that the enjoyment of cycling in windy conditions is a relatively niche preference among riders.

# Hypothesis Testing:
>## Does Working day has an effect on the number of bikes rented ?

In [None]:
# import necessary library for hypothesis test
from scipy.stats import chi2 # Distribution (cdf etc.)
from scipy.stats import chisquare # Perform a chi-square test for goodness of fit.
from scipy.stats import chi2_contingency # Categorical Vs Categorical
from scipy.stats import ttest_rel,ttest_1samp,ttest_ind # Perform a paired (dependent),Perform a one-sample t-test, Perform an independent (two-sample) t-test.
from scipy.stats import binom # Access binomial distribution functions (e.g., PMF, CDF).

'''
F-Test and ANOVA
f: Access the F-distribution functions.
f_oneway: Perform a one-way ANOVA test.
kruskal: Perform the Kruskal-Wallis H-test for independent samples.
levene: Test for equality of variances across samples.

Normality Tests
shapiro: Perform the Shapiro-Wilk test for normality.
kstest: Perform the Kolmogorov-Smirnov test for goodness of fit.
norm: Access normal distribution functions (e.g., PDF, CDF).
'''

from scipy.stats import f,f_oneway,kruskal,ttest_ind,levene,shapiro,kstest,norm
from statsmodels.distributions.empirical_distribution import ECDF
from statsmodels.graphics.gofplots import qqplot # Create a quantile-quantile plot to compare distributions


**Formulating Null and Alternative Hypotheses**

To answer the above question we first set up Null and Alternate Hypothesis:

**H0: WORKING DAY HAS NO EFFECT ON THE NUMBER OF CYCLES RENTED**

**H1: WORKING DAY HAS AN EFFECT ON THE NUMBER OF CYCLES RENTED**

**Solution: To test the above hypothesis, we use Two sample T Test**

## **Assumptions of a T Test**
1. Independence : The observations in one sample are independent of the observations in the other sample.
2. Normality : Both samples are approximately normally distributed.
3. Homogenity of Variances : Both samples have approximately the same variance.
4. Random Sampling : Both samples were obtained using random sampling method.

### **Normality Check: Wilkin Shapiron Test**
* To conduct the above experiment, we shall take the samples randomly, and also the number of electric cyles rented on Working day and non working day are independent.
* We however have to check for Normality and homogenity of Variances
* Generate a sample of 300 bike rentals, randomly selected from both working days and non-working days

In [None]:
workingday_sample = data[data['workingday']==1]['count'].sample(300)
nonworkingday_sample = data[data['workingday']==0]['count'].sample(300)

In [None]:
# To check normality we can use histogram or QQplot

In [None]:
from statsmodels.graphics.gofplots import qqplot
# Data subsets
wrkday = data[data['workingday'] == 0]['count'].values
nwrkday = data[data['workingday'] == 1]['count'].values

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Q-Q plot for working day = 0
qqplot(wrkday, line='s', ax=axes[0])
axes[0].set_title("Q-Q Plot: Non-Working Day")

# Q-Q plot for working day = 1
qqplot(nwrkday, line='s', ax=axes[1])
axes[1].set_title("Q-Q Plot: Working Day")

# Show plots
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(15,4))

#histogram for working day sample
plt.subplot(1,2,1)
sns.histplot(workingday_sample,kde=True,color='mediumseagreen')
plt.title('Working Day')

#histogram for non working day sample
plt.subplot(1,2,2)
sns.histplot(nonworkingday_sample,kde=True,color='mediumseagreen')
plt.title('Non working day')

plt.suptitle('Distribution of number of rented bikes')
plt.show()

**Inference:**

* The counts of rented cycles on both working and non-working days do not follow a normal distribution.
* We can try to convert the distribution to normal by applying log transformation

In [None]:
#Converting the sample distribution to normal using log transformation method:
plt.figure(figsize=(15,4))

#histogram for working day sample
plt.subplot(1,2,1)
sns.histplot(np.log(workingday_sample),kde = True,color ='mediumseagreen')
plt.title('Working day')

#histogram for non working day sample
plt.subplot(1,2,2)
sns.histplot(np.log(nonworkingday_sample),kde=True,color='mediumseagreen')
plt.title('Non working day')

plt.suptitle('Distribution of number of rented bikes')
plt.show()

**Inference:** 
* After implementing a log transformation on our continuous variables, we have observed that there is a substantial improvement in achieving a distribution that closely resembles a normality curve for both the workingday_sample and nonworkingday_sample.

### **We will now conduct the Wilk-Shapiro Test to assess the normality of the log-normal distribution obtained in the previous step**

### Performing the Wilk-Shapiro test for the workingday sample

lets assume the signifiance level to be 5% and the null and alternate hypothesis is as follows:

**H0 : The Working day samples are normally distributed**

**Ha: The Working day samples are not normally distributed**

In [None]:
test_stat, p_value = shapiro(np.log(workingday_sample))
print("test stat :",test_stat)
print("p value :", p_value)
alpha = 0.05 # Significance level
if p_value < alpha:
 print("Interpretation: Reject Ho: The working day samples are not normally distributed.")
else:
 print("Interpretation: Fail to Reject Ho: The working day samples are normally distributed")

### Performing the Wilk-Shapiro test for the non-working day sample

We select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The non working day samples are normally distributed.**

**Ha: The non working day samples are not normally distributed.**

In [None]:
test_stat, p_value = shapiro(np.log(nonworkingday_sample))
print("test stat :",test_stat)
print("p value :", p_value)
alpha = 0.05 # Significance level
if p_value < alpha:
 print("Interpretation: Reject Ho: The Non-working day samples are not normally distributed.")
else:
 print("Interpretation: Fail to Reject Ho: The Non-working day samples are normally distributed")

**Inference:**
**Working day**
* From the above output, we see that the p value is far less than 0.05, Hence we reject the null hypothesis.
* We have sufficient evidence to say that the working day sample data does not come from normal distribution.
 
**Non- Working day**
* From the above output, we see that the p value is far less than 0.05, Hence we reject the null hypothesis.
* We have sufficient evidence to say that the non working day sample data does not come from normal distribution.

### **Homegenity of Variance test : Levene's Test**
We select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : Variance is equal in both working day count and non working day count samples**

**Ha: Variances is not equal**

In [None]:
test_stat,p_value = levene(np.log(workingday_sample),np.log(nonworkingday_sample))
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Interpretation: Reject Ho that Variance is not equal ")
else:
 print("Interpretation: Fail to Reject Ho that Variance is equal in both working day count and non working day count samples")

**Inference:**

* Since pvalue is not less than 0.05, we fail to reject null hypothesis(Accept the Null Hypothesis).
* So it means that we do not have sufficient evidence to say that variance across workingday count and non workingday count is significantly different thus making the assumption of homogenity of variances true

## TTEST ASSUMPTION CONCLUSION:

* 3 out of 4 assumptions for T test has been satified.
* Although the sample distribution did not meet the criteria of passing the normality test, we proceed with the T-test as per the given instructions.

### For T-Test we select the level of signifiance as 5% and the null and alternate hypothesis Framework are defined as follows:

**H0 : Working day does not have an effect on number of cycles rented**

**Ha: Working day does have an effect on number of cycles rented**

In [None]:
t_stat,p_value = ttest_ind(np.log(workingday_sample), np.log(nonworkingday_sample),equal_var=True)
print("f stat :", t_stat)
print("p value :", p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: Working day does have an effect on number of cycles rented ")
else:
 print("Fail to Reject Ho: Working day does not have an effect on number of cycles rented")

**Final Conclusion:**

* Since the p-value of our test is greater than alpha which is 0.05, we fail to reject the null hypothesis.
* We do not have sufficient evidence to conclude that working days have a significant effect on the number of cycles rented.
This suggests that there is no significant difference in the number of cycles rented on working days versus non-working days.

>## Are number of cycles rented similar or different in different season ?

**To perform such an analyis, we will use ANNOVA test.**

* ANNOVA, which stands for Analysis of Variance, is a statistical technique used to assess whether there is a statistically significant difference among the means of two or more categorical groups.
* It achieves this by testing for variations in means by examining the variance within and between these groups.
* Here we have 4 different seasons namely spring,summer,fall & winter.
* With the help of Annova test, we can find out if the different seasons have same or different effect amongst the number of cycles rented.

**Determine the Null and Alternative Hypotheses**
* We shall setup the Null and Alternate Hypothesis to check if there is any effect of season on the number of cycles rented.

**H0: All the 4 different seasons have equal means.**

**Ha: There is atleast one season that differs significantly from the overall mean of dependent variable.**

**Assumptions for ANOVA Test**
* To implement the One-way ANOVA test, we need to make sure that they are satisfying certain conditions:

- Data should be normally distributed (i.e, Gaussian)

- Data should be independent across each record

- Equal variance in different groups

In [None]:
data.season.value_counts()

In [None]:
winter_sample = data[data['season']=='winter']['count'].sample(500)
fall_sample = data[data['season']=='fall']['count'].sample(500)
summer_sample = data[data['season']=='summer']['count'].sample(500)
spring_sample = data[data['season']=='spring']['count'].sample(500)

### To check the Normality we can use histogram

In [None]:
plt.figure(figsize=(20,10))

#histogram for winter season 
plt.subplot(2,2,1)
sns.histplot(winter_sample, kde=True, color='cornflowerblue')
plt.title('Winter Season')

#histogram for fall season 
plt.subplot(2,2,2)
sns.histplot(fall_sample,kde=True,color='hotpink')
plt.title('Fall Season')

#histogram for summer season 
plt.subplot(2,2,3)
sns.histplot(summer_sample,kde=True,color='goldenrod')
plt.title('Summer Season')

#histogram for spring season 
plt.subplot(2,2,4)
sns.histplot(spring_sample,kde=True,color='forestgreen')
plt.title('Spring Season')

plt.suptitle('Distribution of number of rented bikes across different seasons',fontname='Franklin Gothic Medium', fontsize=15)
plt.show()

**Inference:**

* We see that none of the graphs are normally distributed. Hence we apply log transformation to make these distributions near to normal



In [None]:
log_winter = np.log(winter_sample)
log_fall = np.log(fall_sample)
log_summer = np.log(summer_sample)
log_spring=np.log(spring_sample)

In [None]:
plt.figure(figsize=(20,10))

#histogram for winter season 
plt.subplot(2,2,1)
sns.histplot(log_winter, kde=True, color='cornflowerblue')
plt.title('Winter Season')

#histogram for fall season 
plt.subplot(2,2,2)
sns.histplot(log_fall,kde=True,color='hotpink')
plt.title('Fall Season')

#histogram for summer season 
plt.subplot(2,2,3)
sns.histplot(log_summer,kde=True,color='goldenrod')
plt.title('Summer Season')

#histogram for spring season 
plt.subplot(2,2,4)
sns.histplot(log_spring,kde=True,color='forestgreen')
plt.title('Spring Season')

plt.suptitle('Distribution of number of rented bikes across different seasons',fontname='Franklin Gothic Medium', fontsize=15)
plt.show()

**Inference:**

* After applying a log transformation to the samples of each season, it can be inferred that a significant improvement was observed in achieving data distributions that closely resemble normality for each of the seasons.
* We will now conduct the Shapiro-Wilk Test to assess the normality of the log-normal distribution obtained in the previous step

In [None]:
# conduct the Shapiro-Wilk Test to assess the normality of the log-normal distribution obtained in the previous step

* ### Shapiro-Wilk Test for winter, fall,summer and spring season sample data

We select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The sample follows a normal distribution**

**Ha: The sample does not follow a normal distribution**

In [None]:
test_stat,p_value= shapiro(log_winter)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")
    
print('-'*50)

test_stat,p_value= shapiro(log_fall)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")
    
print('-'*50)

test_stat,p_value= shapiro(log_summer)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")
        
print('-'*50)

test_stat,p_value= shapiro(log_spring)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")

**Inference:**

* Even after applying the log transformation, the sample does not conform to a normal distribution, as demonstrated by the Shapiro-Wilk test.

### Homegenity of Variance test : Levene's Test¶
We select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The variance is equal across all groups**

**Ha : The variance is not equal across the groups**

In [None]:
test_stat,p_value= levene(log_winter, log_fall, log_summer, log_spring, center = 'median')
print("test stat :",test_stat)
print("p value :", p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: Variance is not equal across the groups ")
else:
 print("Fail to Reject Ho: Variance is equal across all groups")

Inference:

* Since pvalue is less than 0.05, we reject the null hypothesis.
* It means that we do not have sufficient evidence to claim a significant difference in variance across the different seasons.
* Therefore, the assumption of homogeneity of variances can be considered invalid as it requied to be equal.

**ANOVA Test and Final Conclusion**
* 1 out of 3 assumptions for ANOVA test have been satisfied.

* We continue to do the test since.

For ANOVA Test we select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The mean number of cycles rented is the same across different seasons**

**Ha: At least one season has a mean number of cycles rented that is significantly different from the others.**

In [None]:
f_stat,p_value= f_oneway(log_winter,log_fall, log_summer,log_spring)
print("f stat :",f_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: At least one season has a mean number of cycles rented that is significantly different from the others.")
else:
 print("Fail to Reject Ho: The mean number of cycles rented is the same across different seasons ")

**Conclusion:**

* Since the p-value obtained from our test is less than the predetermined alpha level of 0.05, we have sufficient evidence to reject the null hypothesis for this test.
* Indeed, this implies that we have gathered enough evidence to conclude that there is a significant difference in the mean number of cycles rented across all seasons.

> ## Are number of cycles rented similar or different in different weather conditions?¶

* To perform such an analyis, we use ANNOVA test:

* ANNOVA, which stands for Analysis of Variance, is a statistical technique used to assess whether there is a statistically significant difference among the means of two or more categorical groups. 
* It achieves this by testing for variations in means by examining the variance within and between these groups.

We have to check if there is any significant difference in the number of bikes rented across different weather conditions. To analyse this, we use Annova test.

*  To implement the One-way ANOVA test, we need to make sure that they are satisfying certain conditions:
### Assumption of ANOVA test:
* Data should be normally distributed (i.e, Gaussian)

* Data should be independent across each record

* Equal variance in different groups

In [None]:
data['weather'].value_counts()

* The 4 different weather conditions are as follows:

Clear, Few clouds, partly cloudy, partly cloudy
Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

### Formulate null and alternative hypotheses:
We shall setup the Null and Alternate Hypothesis to check if there is any effect of weather on the number of cycles rented.

**H0 : The mean number of cycles rented is the same across all three different weather types.**

**Ha : There is at least one weather type with a mean number of cycles rented that significantly differs from the overall mean of the dependent variable.**

In [None]:
# Generate a sample of 500 data points for each weather condition:
sample_1 = data[data['weather']==1]['count'].sample(500)
sample_2 = data[data['weather']==2]['count'].sample(500)
sample_3 = data[data['weather']==3]['count'].sample(500)

### To check the assumption 1(Normality) we use histogram

In [None]:
plt.figure(figsize=(20,5))

#histogram for weather condition 1
plt.subplot(1,3,1)
sns.histplot(sample_1,kde = True,color='mediumseagreen')
plt.title('Weather type 1')

#histogram for weather condition 2 
plt.subplot(1,3,2)
sns.histplot(sample_2,kde=True,color='mediumseagreen')
plt.title('Weather type 2')

#histogram for weather condition 3 
plt.subplot(1,3,3)
sns.histplot(sample_3,kde=True,color='mediumseagreen')
plt.title('Weather type 3')

plt.suptitle('Distribution of number of rented bikes across different weather types',fontname='Franklin Gothic Medium', fontsize=15)
plt.show()

Inference:

We see that none of the graphs are normally distributed. Hence we apply log transformation to make these distributions near to normal

In [None]:
# Converting sample distribution to normal by applying log transformation

log_1 = np.log(sample_1)
log_2 = np.log(sample_2)
log_3 = np.log(sample_3)

In [None]:
plt.figure(figsize=(20,5))

#histogram for weather condition 1
plt.subplot(1,3,1)
sns.histplot(log_1,kde = True,color='khaki')
plt.title('Weather type 1')

#histogram for weather condition 2 
plt.subplot(1,3,2)
sns.histplot(log_2, kde=True,color='khaki')
plt.title('Weather type 2')

#histogram for weather condition 3 
plt.subplot(1,3,3)
sns.histplot(log_3,kde=True,color='khaki')
plt.title('Weather type 3')

plt.suptitle('Distribution of number of rented bikes across different weather types',fontname='Franklin Gothic Medium', fontsize=15)
plt.show()

Inference:

After using a log transformation on the data for each weather type, we noticed a substantial improvement in making the data look more like a normal distribution.

We will now conduct the **Shapiro-Wilk Test** to assess the normality of the log-normal distribution obtained in the previous step

### Shapiro-Wilk Test for weather type 1, 2 and 3 sample data:

We select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The sample follows a normal distribution**

**Ha: The sample does not follow a normal distribution**

In [None]:
print('---------Weather Type 1---------')
test_stat,p_value= shapiro(log_1)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")
    
print('---------Weather Type 2---------')
test_stat,p_value= shapiro(log_1)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")
    
print('---------Weather Type 2---------')
test_stat,p_value= shapiro(log_1)
print("test stat :",test_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: The sample does not follow a normal distribution")
else:
 print("Fail to Reject Ho:The sample follows a normal distribution")

**Inference:**
* So Even after applying the log transformation, the sample does not align to be normal distribution, as observed by the Shapiro-Wilk test.

**Final Conclusion:**

* None of the weather type samples adhere to a normal distribution even after applying the log-normal transformation, indicating that the normality assumption of the ANOVA test is does not met.

### Homegenity of Variance test : Levene's Test (Assumption 2)

We select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The variance is equal across all groups**

**Ha : The variance is not equal across the groups**

In [None]:
test_stat, p_value = levene(log_1,log_2,log_3)
print('test_stat:',test_stat)
print('p_value:',p_value)
alpha = 0.05
if p_value < alpha:
    print('Reject H0, The variance is not equal across the groups')
else:
    print('Fail to Reject H0, The variance is equal across all the groups')

**Inference:**

* Since pvalue is not less than 0.05(its greater than 0.05), we fail to reject the null hypothesis.
* This means that we do not have any sufficient evidence to claim that there is significant difference in variance across the different weather types.
* Therefore, the assumption of homogeneity of variances can be considered valid.

### Implement ANOVA Test: 
* so 2 out of 3 assumptions for ANOVA test have been satisfied. We continue to do the test since we have been instructed to do so.

* For ANOVA Test we select the level of signifiance as 5% and the null and alternate hypothesis is as follows:

**H0 : The mean number of cycles rented is equal across different weather conditions.**

**Ha: There is at least one weather condition with a mean number of cycles rented that significantly differs from the others.**

In [None]:
f_stat,p_value = f_oneway(log_winter,log_fall, log_summer,log_spring)
print("test stat :",f_stat)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: There is at least one weather condition with a mean number of cycles rented that significantly differs from the others ")
else:
 print("Fail to Reject Ho: The mean number of cycles rented is equal across different weather conditions")

**Final Conclusion:**

* Since the p-value obtained from our test is less than the predetermined alpha level of 0.05, we have sufficient evidence to reject the null hypothesis for this test.
* Indeed, this indicates that we have collected sufficient evidence to conclude that there is a significant difference in the mean number of cycles rented across all weather conditions.
* **Additionally, this suggests that weather conditions do have a notable effect on the number of cycles rented.**

> ## Is weather type dependent on the season?

* So to perform such an analysis we have to perform Chi square test.

**A Chi-Square Test of Independence is used to determine whether or not there is a significant association between two categorical variables.**

* The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical variables.

* Also it is important to know that Chi-Square test is non parametric test meaning that it is distribution free (need not have gaussian distribution).

* ### Assumptions of Chi-Square Test
1. Variables are categorical
2. Observations are independent
3. Each cell is mutually exclusive
4. The expected value in each cell should be greater than 5.

### Conclusion on Assumptions:
* Both variables are categorical: In this dataset, season column has already been converted into categorical data and the weather column is nominal data. Hence it is safe to say that the above condition is satisfied.

* All observations are independent:
We are hoping that the sample provided by YULU has been obtained from random sampling upon which the condition is satisfied.

* Cells in the contingency table are mutually exclusive:
Assuming each individual in the dataset was only surveyed once, this assumption should be met.

* Expected value of cells should be 5 or greater in at least 80% of cells and none less than 1
We shall check for this condition after the pearson chi-square test has been completed.

In [None]:
chi_data = pd.crosstab(data.weather, data.season)
chi_data

**Inference:**
* we can observed that there is only one row in our dataset which is of weather type 4 having lack of sufficient information or data to determine if it truly correlates with the season.
* So, to avoid potential biaseness and skewed results, it's advisable to exclude this weather type from our analysis.

In [None]:
chi_data.index

In [None]:
df_removed_weather = chi_data[~(chi_data.index == 4)]
df_removed_weather

### Implementing Chi-Square Test

Lets formulate our Null and alternate Hypotheis to check if Weather is dependent on season

**H0: Weather is not dependent on the season**
**Ha: Weather is dependent on the season, meaning they are associated or related**

We consider level of significance or alpha to be 0.05.

In [None]:
chi2_contingency(df_removed_weather)

In [None]:
x_stat,p_value,dof,expected_freq = chi2_contingency(df_removed_weather)
print("X stat :",x_stat)
print("DOF :",dof)
print('Expected Frequency:',expected_freq)
print("p value :",p_value)
alpha = 0.05
if p_value< alpha:
 print("Reject Ho: Weather is dependent on the season")
else:
 print("Fail to Reject Ho: Weather is not dependent on the season")

**Conclusion:**

* Since the p-value obtained from our test is less than the predetermined alpha level of 0.05, we have sufficient evidence to reject the our null hypothesis for this test.
* Indeed, this suggests that we have gathered enough evidence to conclude that there is a dependence between weather and the season

In [None]:
expected_freq.min()

In [None]:
# Checking Cchi2_contingency assumption: The expected value in each cell should be greater than 5
(len(expected_freq[expected_freq<5])/len(expected_freq))*100

**Inference:**

* All of the data points have expected values greater than 5, indicating that the assumption related to the expected values being greater than 5 is satisfied for the chi-square test.

# **Insights:**
* On working days, 68.1% of cycles are rented, whereas on non-working days, 31.9% of cycles are rented.

* Despite the fact that 68.1% of cycles are rented on working days compared to 31.9% on non-working days, our t-test analysis does not provide sufficient evidence to conclude that working days have a significant effect on the number of cycles rented. This finding suggests that there is no statistically significant difference in the number of cycles rented between working days and non-working days.

* During the fall, summer and winter seasons approximately 25.1% of cycles were rented(respectively).
* The lowest rental rate(In comparison with other seasons), 24.7% of cycles were rented,is observed in the spring season.

* The ANOVA test results indicate a clear and statistically significant difference in the mean number of cycles rented across all seasons. This underscores the notion that the season plays a substantial role in influencing the number of bikes rented.

* In Weather condition 1, they experiences the highest rental rate, with approximately 66% of cycles rented.
* In weather condition 2, around 26.0% of cycles were rented.
* Weather condition 3 has a rental rate of approximately 7.9 % for cycles.
* Weather condition 4 exhibits an exceptionally low rental rate, with only 0.0% of cycles being rented(In count 1).


* The ANOVA test results indicate a significant difference in the mean number of cycles rented across all weather conditions, which strongly suggests that weather types have a significant effect on the number of cycles rented.
* The chi-square test results reveal a statistically significant association between weather type and the season.

* The highest average count of rental bikes is observed at 5 PM, closely followed by 5 PM and 7 AM. This indicates distinct peak hours during the day when cycling is most popular.
* Conversely, the lowest average count of rental bikes occurs at 4 AM, with 3 AM and 5 AM also showing low counts. These hours represent the early morning period with the least demand for cycling. Notably, there is an increasing trend in cycle rentals between 5 AM and 8 AM, suggesting a surge in demand during the early morning hours as people start their day.
* Additionally, there is a decreasing trend in cycle rentals from 5 PM to 11 PM, indicating a gradual decline in demand as the day progresses into the evening and nighttime.

* The highest average hourly count of rental bikes occurs in June, July, and August, reflecting the peak demand during summer season.
* Conversely, the lowest average hourly count of rental bikes is found in January, February, and March, which are the winter months with reduced cycling activity.
* Notably, there is an increasing trend in average bike rentals from February to June, corresponding to the shift from winter to spring and summer. 
* Conversely, a decreasing trend in average bike rentals is observed from October to December due to the onset of winter.


# **RECOMMENDATIONS:**

- In summer,fall,and winter seasons the company should have more bikes in stock. Because the demand in these seasons is higher than spring(but not that much lower).
- When weather is 1 i.e Clear, Few clouds, partly cloudy, partly cloudy most of the bikes were rented.So for this weather we
have high sales and demand whcih need not to be compromised so we need to increase the stock so that i will be available for all the
customers.
- Whenever temperature is less than 10 or in very cold days, company should have less bikes.
- People love to cycle when the atemperate is between 20 to 35.
- In very low humidity days, company should have less bikes in the stock to be rented.

- Whenever the windspeed is greater than 35 or in thunderstorms, company should have less bikes in stock as it is least to be rented.

- As we observe The lowest average count of rental bikes occurs from 1 am to 4 am.so we can capitalize this time for the cycle repair if needed,maintenance and the charging so to increase the operational efficiency
- After maintenance and charging, strategically deploy bikes to high-demand areas in preparation for the morning rush.
- Ensure that bikes are available at key locations, such as transportation hubs, offices, and residential areas.

- During the seasons with adverse weather conditions, such as rain or snow, Yulu can provide weather-ready bikes equipped with features like fenders and all-weather tires or anti skid tyres.
- This ensures that riders are comfortable and safe during inclement weather.


## Q. What is the duration or time period for which the data is given ?

In [None]:
data.datetime.min(), data.datetime.max()

In [None]:
duration = data.datetime.max() -  data.datetime.min()
duration