# EV dataset EDA and Regression


![](https://www.bing.com/images/search?view=detailV2&ccid=T2QoJuJG&id=E75578CF56A73AB9A7C6447EBCC61C16002C2A0C&thid=OIP.T2QoJuJGYMJj-Tp5E1Jf-wHaFy&mediaurl=https%3a%2f%2fcdn.images.express.co.uk%2fimg%2fdynamic%2fgalleries%2fx701%2f312168.jpg&exph=701&expw=898&q=electric+vehicles+tesla&simid=607992349406464966&FORM=IRPRST&ck=255D0C4587E366A6C128CA6D7B1C6FD9&selectedIndex=5&ajaxhist=0&ajaxserp=0)

**Import of Packages**

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import statsmodels.api as sm

**Import of the CSV file**

In [None]:
df= pd.read_csv('../input/evs-one-electric-vehicle-dataset/ElectricCarData_Clean.csv')

**Top five rows of the dataset**

In [None]:
df.head()

**Finding out the number of null values**

In [None]:
df.isnull().sum()

There exists no null value

**Descriptive Statistics of the dataset**

In [None]:
df.describe()

**Information of the ype of data in seach column**

In [None]:
df.info()

In [None]:
a=np.arange(1,104)

**Pairplot of all the columns based on Rapid Charger presence**

In [None]:
sb.pairplot(df,hue='RapidCharge')

**Heatmap to show the correlation of the data**

In [None]:
ax= plt.figure(figsize=(15,8))
sb.heatmap(df.corr(),linewidths=1,linecolor='white',annot=True)

**Frequency of the Brands in the dataset**

In [None]:
ax= plt.figure(figsize=(20,5))
sb.barplot(x='Brand',y=a,data=df)
plt.grid(axis='y')
plt.title('Brands in the datset')
plt.xlabel('Brand')
plt.ylabel('Frequency')
plt.xticks(rotation=45)

Byton , Fiat and smart are the prominent brands and Polestar being the least 

**Top speeds achieved by the cars of a brand**

In [None]:
ax= plt.figure(figsize=(20,5))
sb.barplot(x='Brand',y='TopSpeed_KmH',data=df,palette='Paired')
plt.grid(axis='y')
plt.title('Top Speed achieved by a brand')
plt.xlabel('Brand')
plt.ylabel('Top Speed')
plt.xticks(rotation=45)

Porsche, Lucid and Tesla produce the fastest cars and Smart the lowest



**Range a car can achieve**

In [None]:
ax= plt.figure(figsize=(20,5))
sb.barplot(x='Brand',y='Range_Km',data=df,palette='tab10')
plt.grid(axis='y')
plt.title('Maximum Range achieved by a brand')
plt.xlabel('Brand')
plt.ylabel('Range')
plt.xticks(rotation=45)

Lucid, Lightyear and Tesla have the highest range and Smart the lowest


**Car efficiency**

In [None]:
ax= plt.figure(figsize=(20,5))
sb.barplot(x='Brand',y='Efficiency_WhKm',data=df,palette='hls')
plt.grid(axis='y')
plt.title('Efficiency achieved by a brand')
plt.xlabel('Brand')
plt.ylabel('Efficiency')
plt.xticks(rotation=45)

Byton , Jaguar and Audi are the most efficient and Lightyear the least

**Number of seats in each car**

In [None]:
ax= plt.figure(figsize=(20,5))
sb.barplot(x='Brand',y='Seats',data=df,palette='husl')
plt.grid(axis='y')
plt.title('Seats in a car')
plt.xlabel('Brand')
plt.ylabel('Seats')
plt.xticks(rotation=45)

Mercedes, Tesla and Nissan have the highest number of seats and Smart the lowest

**Price of cars (in Euro)**

In [None]:
ax= plt.figure(figsize=(20,5))
sb.barplot(x='Brand',y='PriceEuro',data=df,palette='Set2')
plt.title('Price of a Car')
plt.xlabel('Price in Euro')
plt.grid(axis='y')
plt.ylabel('Frequency')
plt.xticks(rotation=45)

Lightyear, Porsche and Lucid are the most expensive and SEAT and Smart the least

**Type of Plug used for charging**

In [None]:
df['PlugType'].value_counts().plot.pie(figsize=(8,15),autopct='%.0f%%',explode=(.1,.1,.1,.1))
plt.title('Plug Type')

Most companies use Type 2 CCS and Type 1 CHAdeMo the least

**Cars and their body style**

In [None]:
df['BodyStyle'].value_counts().plot.pie(figsize=(8,15),autopct='%.0f%%',explode=(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1))
plt.title('Body Style')

Most cars are eiher SUV or Hatchback  

**Segment in which the cars fall under**

In [None]:
df['Segment'].value_counts().plot.pie(figsize=(8,15),autopct='%.0f%%',explode=(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1))
plt.title('Segment')

Most cars are either C or B type 


**Number of Seats**

In [None]:
df['Seats'].value_counts().plot.pie(figsize=(8,15),autopct='%.0f%%',explode=(0.1,0.1,0.1,0.1,0.1))
plt.title('Seats')

Majority of cars have 5 seats 

**Putting independent variables as x and dependent variable as y**

In [None]:
x=df[['AccelSec','Range_Km','TopSpeed_KmH','Efficiency_WhKm']]
y=df['PriceEuro']

**Finding out the linear regression using OLS method**

In [None]:
x= sm.add_constant(x)
results = sm.OLS(y,x)

**Fitting the model and summarizing**

In [None]:
model=results.fit()
model.summary()

Only Top Speed and Efficieny are the two variables related to price

**Importing train test split from Scikit Learn**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=365)

**Importing Linear regression**

In [None]:
from sklearn.linear_model import LinearRegression
lr= LinearRegression()

In [None]:
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

**Finding out the R-squared value**

In [None]:
from sklearn.metrics import r2_score
r2=(r2_score(y_test,pred))
print(r2*100)

Around 78% of the dependant variable has been explained by the independant variables

**Putting Yes value as 1 and No value as 0 for Logistic Regression**

In [None]:
df['RapidCharge'].replace(to_replace=['No','Yes'],value=[0, 1],inplace=True)

In [None]:
y1=df[['RapidCharge']]
x1=df[['PriceEuro']]

In [None]:
from sklearn.model_selection import train_test_split
X1_train, X1_test, y1_train, y1_test = train_test_split(x1, y1, test_size=0.2,random_state=365)

**Importing Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log= LogisticRegression()

In [None]:
log.fit(X1_train, y1_train)
pred1 = log.predict(X1_test)
pred1

**Confusion Matrix of the regression**

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y1_test, pred1)
cm

**Finding out the accuracy score**

In [None]:
from sklearn.metrics import accuracy_score
score=accuracy_score(y1_test,pred1)
score*100

The data is accurate upto 95%

# Thank You