
# Predicting Women in STEM using Linear Regression Model
                
## History

###Data is originally from American Community Survey 2010-2012 Public Use Microdata Series. The dataset consists of 75 STEM major studies divided into 5 categories. It also gives the number of men and women in each major along with the median salary and the proportion of women(ShareWomen) in each STEM major.

##Women in STEM dataset can be loaded from the datasets in pandas using read_csv(). To work with linear regression datasets, we need the libraries numpy, pandas and seaborn. For modeling, we would need statsmodels or sklearn. The features are available in feature_names list. We can feed this data into a dataframe to use statsmodels or sklearn for regression modeling.



In [46]:
# import python libraries
import numpy as np   # For linear algebra
import pandas as pd  # For data preprocessing and CSV File 
import sklearn  # For training machine learning models!
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



Read csv file 



use read_csv() from pandas 

In [47]:
data = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv")



### Exploratory Data Analysis
                

            



use describe() from pandas library

In [48]:
data.describe()

Unnamed: 0,Rank,Major_code,Total,Men,Women,ShareWomen,Median
count,76.0,76.0,76.0,76.0,76.0,76.0,76.0
mean,38.5,3580.026316,25515.289474,12800.763158,12714.526316,0.436929,46118.421053
std,22.083176,1437.455038,43998.008553,21307.554101,29056.014723,0.232176,13187.223216
min,1.0,1301.0,609.0,488.0,77.0,0.077453,26000.0
25%,19.75,2409.75,3782.0,2047.75,1227.5,0.247918,36150.0
50%,38.5,3601.5,11047.5,4583.0,5217.5,0.405868,44350.0
75%,57.25,5002.25,27509.25,11686.5,12463.5,0.591803,52250.0
max,76.0,6199.0,280709.0,111762.0,187621.0,0.967998,110000.0



### Coverting Categorical varibles to dummy variables


In [49]:
data = pd.get_dummies(data,drop_first =True)



               
### Simple preprocessing
Now we are going to do a little bit of preprocessing and split our data into a set of features (X) and a set of target labels (y).
            



In [50]:
X = data.drop(['Major_code','Rank','Total','ShareWomen'], axis=1)
y = data['ShareWomen']



 
                
### Train, Test, Split
In order to train the model , we need a way to evaluate our models. In order to do this, we want to split our data into a training set and a testing set. Rather than training our model on the entire dataset for the purpose evaluation, we only train the model on the training test first and then evaluate the model's performance in predicting the class of samples on the test set.




use train_test_split() from scikit learn library

In [51]:
from sklearn.model_selection import train_test_split # This function is necessary to split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12345)

### What train_test_split does:

Takes in two dataframes, X and y as parameters corresponding to the features and target column.
Contains a parameter called test_size that allows you to specify the proportion of the data that is used for the test set. In the case we will make the test set 30% of our data, which is usually pretty standard.
The random_state parameter allows us to specify the random seed so that the results are reproducible.

The function returns a tuple of four objects:

<li>X_train: Dataframe with feature values for training set.
<li>X_test: Dataframe with feature values for testing set.
<li>y_train: Dataframe with labels (diagnosis) for training samples.
<li>y_test: Dataframe with labels (diagnosis) for testing set.


### Training a Machine Learning Model - Linear Regression

Now that we have split our data into training and test sets, we are ready to train a machine learning model using scikit-learn! We will be using a Linear Regression model 



In [52]:
from sklearn.linear_model import LinearRegression
stem_model= LinearRegression()
var1=stem_model.fit(X_train, y_train)
var1

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [53]:
#this is assertion block which verifies the solution.

try:
    def verify_answer():
      
        if 1 == 1: 
            return True
        else:
            return False
      

    ref_assert_var = verify_answer()
except Exception as e:
    print('Your assertion block throws error: ' + str(e))
else:
    if ref_assert_var:
        print('continue')
    else:
        print('The answer did not pass the test.')

continue




## Predict the value of share of women on Train and Test Data


hint, write any hint here for the above question.

hint, write any hint here for the above question.

use predict() in regression model

In [58]:
# make predictions on the train set
y_train_pred = stem_model.predict(X_train)

# make predictions on the testing set
y_test_pred = stem_model.predict(X_test)

In [59]:
#this is assertion block which verifies the solution.

try:
    def verify_answer():
      
        if 1 == 1: 
            return True
        else:
            return False
      

    ref_assert_var = verify_answer()
except Exception as e:
    print('Your assertion block throws error: ' + str(e))
else:
    if ref_assert_var:
        print('continue')
    else:
        print('The answer did not pass the test.')

continue




## Calculate Mean Squared error  for training dataset




In [61]:
from sklearn import metrics
MSE_Train= metrics.mean_squared_error(y_train, y_train_pred)
MSE_Train


1.2196914337804364e-31

In [63]:
#this is assertion block which verifies the solution.

try:
    def verify_answer():
       
        if MSE_Train == 1.2196914337804364e-31 : 
            return True
        else:
            return False
     

    ref_assert_var = verify_answer()
except Exception as e:
    print('Your assertion block throws error: ' + str(e))
else:
    if ref_assert_var:
        print('continue')
    else:
        print('The answer did not pass the test.')

continue



## Calculate Mean Squared error  for test dataset


In [72]:
MSE_Test = metrics.mean_squared_error(y_test, y_test_pred)
print(MSE_Test)

0.0175823010414


In [77]:
#this is assertion block which verifies the solution.

try:
    def verify_answer():
       
        if 1 == 1: 
            return True
        else:
            return False
     

    ref_assert_var = verify_answer()
except Exception as e:
    print('Your assertion block throws error: ' + str(e))
else:
    if ref_assert_var:
        print('continue')
    else:
        print('The answer did not pass the test.')

continue




## Predicting the accuracy of the model
                



In [78]:
# print the R-squared value for the model
# calculate the R-squared

accuracy = stem_model.score(X_test, y_test)

0.76140758167225142