<h2 style="color:green" align="center">Data Analysis and Visualization with scikit learn and plotly</h2>



**Double click and paste here the name of the path of the CSV file on your computer:**

/Users/yourname/documents/master/programming/DataAnalysisVisualization/Dataset_AnaVis.csv

**First:** make sure that all necessary libraries are installed in order that you can import them.
 <ol>
  <li>Start Anaconda</li>
  <li>Go to Environment</li>
  <li>Select 'All' from dropdown top left</li>
  <li>Search for pandas, numpy, scikit-learn, plotly (pandas & numpy should alreday be there)</li>
  <li>If no ticke --> tick library and click 'Apply'</li>
</ol> 


<h4 style="color:green" >Ready to start importing the libs and our data</h4>


<img src="Explanation1.png"><img src="Explanation3.png">

In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
# from plotly.offline import iplot (for offline graphics, not needed today)


# Import CSV file by asking for imput path. 
# The CSV file should be saved in the same folder as this file and is called Dataset_AnaVis. Please enter 'Dataset_AnaVis.csv' or the whole path of your file when asked
pth = input('Please enter the path of the CSV file')

# specify structure of CSV  and import it as a pandas dataframe
df = pd.read_csv(pth, delimiter=';', decimal=',',engine='python')

df.head() # show first 6 lines



<h3 style='color:green'>1. Short introduction: meaning of variables</h3>
Just to give very short information of the data: Study on adoption of electrical vehicles as an innovation where one group was given additional information about electric vehicles (EV) (Group = 1) whereas the other group was given no information (Group = 2).
<br></br>Note: Please be aware that this is the raw data, no outlier were excluded

The empirical data are based on questionnairs on the **theorey of planned behavior (Ajzen, 1985)** shown below, and the **diffusion of innovation theory (Rogers & Shoemaker, 1971)** shown below and a behavioral paradigm on vehicle choice was executed.
<img src="TPB_modified.png" style='height:198px;width:300px'>
<div style='font-size:10px'>Soure: https://www.researchgate.net/figure/Innovation-Diffusion-Theory-IDT-Rogers-1983_fig1_315416994 \n<div>
    <br></br>
<img src="IDT_Rogers.png" style='height:198px;width:300px'>
<div style='font-size:10px'>Soure: https://www.researchgate.net/figure/A-modified-version-of-the-theory-of-planned-behaviour_fig2_241054757\n<div>


We need to delete cells with missing values (NA / NaN) because they are not well handled

In [None]:
df.dropna(inplace = True) #drop rows with missing values in the set directly
df.head()

Let's look on the datatypes of the imported variables 

In [None]:
dataTypeSeries = df.dtypes #variable for datytpyes
print(dataTypeSeries)

If not the expected datatype, cast the variables to the right datatype to ensure numeric is treated as numeric / string as object (in case of df)

In [None]:
#cast to correct datatype (for example sex as int), as we do not have big numbers Int32 / Float32 will be enough
df['Sex'] = df['Sex'].astype('Int32')
df['Group'] = df['Group'].astype('Int32')
df['Age'] = df['Age'].astype('Int32')
df['RelativeAdvantage'] = df['RelativeAdvantage'].astype('Float32')
df['Complexity'] = df['Complexity'].astype('Float32')
df['Compatibility'] = df['Compatibility'].astype('Float32')
df['Attitude'] = df['Attitude'].astype('Float32')
df['SubjectiveNormPeers'] = df['SubjectiveNormPeers'].astype('Float32')
df['SubjectiveNormSociety'] = df['SubjectiveNormSociety'].astype('Float32')
df['SubjectiveNormMedia'] = df['SubjectiveNormMedia'].astype('Float32')
df['SubjectiveNormTotal'] = df['SubjectiveNormTotal'].astype('Float32')
df['PerceivedMoralNorm'] = df['PerceivedMoralNorm'].astype('Float32')
df['EnvironmentalAttitude'] = df['EnvironmentalAttitude'].astype('Float32')
df['SelfEfficacy'] = df['SelfEfficacy'].astype('Float32')
df['PurchaseInterest'] = df['PurchaseIntention'].astype('Float32')
df['EVChoice'] = df['EVChoice'].astype('Int32')

dataTypeSeries = df.dtypes #check result
print(dataTypeSeries)


### 2. Let's explore the data a little bit:




In [None]:
# Histogram of the age distribution
figAge = px.histogram(df,x='Age', histfunc = 'sum')
figAge.show() #show the figure directly 


# Displays environmental attitude along the age distribtuion
figAgeEnv = px.scatter(df, x='Age', y='EnvironmentalAttitude')
figAgeEnv.show()
# also possible to show the plot in an html file
# figAgeEnv.write_html('figAgeEnv.html', auto_open = True)

# Environmental attitude in relation to attitude towards EV
figPurchase = px.scatter(df, x='EnvironmentalAttitude', y='PurchaseIntention')
figPurchase.show()




In [None]:
# it is possible to add a trendline for numerical data predicing with ordinary least squares
figPurchase = px.scatter(df, x='EnvironmentalAttitude', y='PurchaseIntention', trendline='ols')
figPurchase.show()



<h3> 3. Let's do some theoretically based plotting</h3>

In [None]:
figRA = px.scatter(df, x='RelativeAdvantage', y='Attitude', trendline='ols')
figRA.show()

figComp = px.scatter(df, x='Complexity', y='Attitude', trendline='ols')
figComp.show()

#figCom = px.scatter(df, x='Compatibility', y='Attitude', trendline='ols', marginal_y='violin', marginal_x='violin')
#figCom.show()

# show different plots with same y-variable in one diagramm: first initiate single plots and stack together
# trace1 = go.Scatter(x=df.RelativeAdvantage, y=df.Attitude, mode='markers', name='Relative Advantage')
# trace2 = go.Scatter(x=df.Compatibility, y=df.Attitude, mode='markers', name='Compatibility')
# trace3 = go.Scatter(x=df.Complexity, y=df.Attitude, mode='markers', name='Complexity')
# figStacked = go.Figure([trace1, trace2, trace3])
# figStacked.show()

In [None]:
# Compatibility predicting Attitude, grouped by group (1 = no info / 2 = info)
figRA2 = px.scatter(df, x='Compatibility', y='Attitude', facet_col='Group', color='RelativeAdvantage', trendline='ols')
figRA2.show()

# perceived behavioral control predicitng purchase intention like in the theory of planned behavior
# figAttiChoice = px.scatter(df, x='PerceivedBehavioralControl', y="PurchaseIntention", color='Group', trendline='ols')
# figAttiChoice.show()

In [None]:
# behavioral component electric vehicle choice depending on attitude / environmental attitide / subjective norm media
figAttEV = px.histogram(df, x='Attitude', y="EVChoice", histfunc='sum', color='Group')
figAttEV.show()

figEnvEV = px.histogram(df, x='EnvironmentalAttitude', y="EVChoice", histfunc='sum', color='Group')
figEnvEV.show()

figNorm = px.histogram(df, x='SubjectiveNormMedia', y="EVChoice", histfunc='sum', color='Group')
figNorm.show()


<h3> 4. Just some more data exploration<h3>


In [None]:
fig3 = px.scatter(df, x="EnvironmentalAttitude", y="EVChoice")
fig3.show()

fig4 = px.scatter(df, x="EnvironmentalAttitude", y="EVChoice", color="Sex")
fig4.show()

fig5 = px.scatter(df, x="EnvironmentalAttitude", y="EVChoice", color="Sex", marginal_y="violin",marginal_x="box")
fig5.show()


<h3> 4. Data analysis with scikit learn</h3>
    
<text>First we will explore linear regression:</text>
We will need to transform our pandas dataframe to a numpy moredimensional array 

In [None]:
npdf = df.to_numpy(copy=True)
x = npdf[:,13].reshape(-1, 1)
y = npdf[:,17].reshape(-1, 1)
#print(x)
#print('______________')
#print(y)

# apply linear regression to the data: predictor= column 13 (perceived behavioral control) predicting column 17 purchase intention
lr = LinearRegression()
lr.fit(x,y)
y_pred = lr.predict(x)


# plot with mathplotlib that looks not so nice
plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.show()



<h3> 5. Plot the linear regression with plotly</h3>

In [None]:
# we want to plot the data with plotly

# we need to reshape the dimensions of our output data in oder to feed plotly with it
y_pred_list = []
x_list = []
for i in range(x.shape[0]):
    y_pred_list.append(y_pred[i,0])
    x_list.append(x[i,0])
    
# create a dataframe of reshaped output columns
d = {'X': x_list, 'Y Prediction': y_pred_list}
df2= pd.DataFrame(data=d)
df2

# plot it with plotly
fig = px.scatter(df2, x='X', y='Y Prediction',trendline='ols')
fig.show()


<h3> 6. Multiple regression predicting attitude with innovation properties</h3>

In [None]:
# initialize regression
reg = linear_model.LinearRegression()

# fill in parameters for the regression (predictors and criterion)
reg.fit(df[['RelativeAdvantage','Complexity','Compatibility']],df.Attitude)

In [None]:
# show results and predict an attitude based on RA, Com, Compa

print(reg.coef_)
print(reg.intercept_)
print(reg.predict([[5.6,2.1,4.5],[5.8,2.14,4.7]]))