# Question 1: Propensity score matching

In this exercise, you will apply [propensity score matching](http://www.stewartschultz.com/statistics/books/Design%20of%20observational%20studies.pdf), which we discussed in lecture 5 ("Observational studies"), in order to draw conclusions from an observational study.

We will work with a by-now classic dataset from Robert LaLonde's study "[Evaluating the Econometric Evaluations of Training Programs](http://people.hbs.edu/nashraf/LaLonde_1986.pdf)" (1986).
The study investigated the effect of a job training program ("National Supported Work Demonstration") on the real earnings of an individual, a couple of years after completion of the program.
Your task is to determine the effectiveness of the "treatment" represented by the job training program.

## Dataset description

- `treat`: 1 if the subject participated in the job training program, 0 otherwise
- `age`: the subject's age
- `educ`: years of education
- `race`: categorical variable with three possible values: Black, Hispanic, or White
- `married`: 1 if the subject was married at the time of the training program, 0 otherwise
- `nodegree`: 1 if the subject has earned no school degree, 0 otherwise
- `re74`: real earnings in 1974 (pre-treatment)
- `re75`: real earnings in 1975 (pre-treatment)
- `re78`: real earnings in 1978 (outcome)

If you want to brush up your knowledge on propensity scores and observational studies, we highly recommend Rosenbaum's excellent book on the ["Design of Observational Studies"](http://www.stewartschultz.com/statistics/books/Design%20of%20observational%20studies.pdf). Even just reading the first chapter (18 pages) will help you a lot.

## 1. A naive analysis

Compare the distribution of the outcome variable (`re78`) between the two groups, using plots and numbers.
To summarize and compare the distributions, you may use the techniques we discussed in lectures 4 ("Read the stats carefully") and 6 ("Data visualization").

What might a naive "researcher" conclude from this superficial analysis?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
import seaborn as sns
import networkx as nx
from networkx.algorithms import bipartite
from networkx.algorithms import max_weight_matching
%matplotlib inline

<span style="color:blue"> First, we load our data into a dataframe. 

In [None]:
lalonde = pd.read_csv('lalonde.csv', index_col=0)
lalonde.head()

<span style="color:blue">We create a *race* column that replaces the *black* and *hisp* columns. It permits to directly access the information: 0 for "black", 1 for "hispanic", 2 for "white" (neither black nor hispanic). We also replace the *nodegree* column by a *degree* column (inverting the boolean value) to make it more consistent with other features such as *married*, where 0 means "not married", whereas for *nodegree*, 0 means "no no degree = degree".

In [None]:
lalonde['race'] = 2-lalonde['hispan']-2*lalonde['black'] # black race: 0, hispanic: 1, white: 2
lalonde['degree'] = 1+lalonde['nodegree']*-1 # inverting the nodegree column 

# remove unnecessary columns
del lalonde['nodegree']
del lalonde['black']
del lalonde['hispan']

lalonde = lalonde[['treat', 'age', 'educ', 'married', 'race', 'degree', 're74', 're75', 're78']]
lalonde.head()

<span style="color:blue">We also create 2 dataframes, *treat* and *no_treat*, to separate subject according to their completion of the program or not.

In [None]:
treat = lalonde[lalonde['treat']==1] # only people who completed the program
no_treat = lalonde[lalonde['treat']==0] # only people who did not complete the program
treat.describe()

<span style="color:blue">Let's observe the difference in revenue in 1978 between people who completed the program and those who did not.

In [None]:
vectors.shape

<span style="color:blue"> It seems we have a little over 18k articles and more than 134k features.

<span style="color:blue">We now split the data into training, testing, and validation sets. We use sklearn's builtin function to do so. We supply a predefined seed for the random number generator to ensure determinism across executions. We also ensure that the split is stratified, so that the proportion of categories remains similar in each subset.

In [None]:
# First, we split 80-20 to get the training set
training, remaining = train_test_split(list(range(0, len(newsgroups.data))), train_size=0.8, test_size=0.2, stratify=newsgroups.target, random_state=42)

# Next, we split 50-50 on the remaining 20%, so that testing and validation are both 10% of the original dataset
testing, validation = train_test_split(remaining, train_size=0.5, test_size=0.5, stratify=list(map(lambda idx: newsgroups.target[idx], remaining)), random_state=42)

# Sort
training.sort()
validation.sort()
testing.sort()

# Let's print out the sizes of each subset...
print('Training size: %d.\nTesting size: %d\nValidation size: %d.' % (len(training), len(testing), len(validation)))

<span style="color:blue">We create convenience functions to select the associated data, features, and observations from the indices in each subset.

In [None]:
def select_data(subset):
    return list(map(lambda idx: newsgroups.data[idx], subset))

def select_features(subset):
    return vectors[subset, :] # slice on rows, keep all columns

def select_observations(subset):
    return list(map(lambda idx: newsgroups.target[idx], subset))

<span style="color:blue">Using the above functions, we extract the features and observations for each subset of the original data.

In [None]:
training_data = select_data(training)
training_features = select_features(training)
training_observations = select_observations(training)
testing_data = select_data(testing)
testing_features = select_features(testing)
testing_observations = select_observations(testing)
validation_data = select_data(validation)
validation_features = select_features(validation)
validation_observations = select_observations(validation)

## 2. 
Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.

<span style="color:blue">Some imports...

In [None]:
from time import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

<span style="color:blue">We build a simple pipeline which consists in a `TfidfVectorizer` followed by a random `RandomForestClassifier`. This will work as follows:

1. <span style="color:blue">Documents fed to the pipeline will be tokenized and converted into a token count matrix.
2. <span style="color:blue">The matrix will be converted into a TF-IDF representation (i.e. normalized term frequency times inverse document frequence)
3. <span style="color:blue">We feed the output to a random forest classifier which will fit the samples to a series of decision trees.

<span style="color:blue">Note: We are building the full pipeline here (i.e. including the first step which we did before) because one may want to perform a grid search for parameters of the TF-IDF vectorizer too (although this is not asked in the question).

<span style="color:blue">We evaluate precision with cross-validation.

In [None]:
precision = cross_val_score(logistic, X, y, cv=10, scoring = "precision")
recall = cross_val_score(logistic, X, y, cv=10, scoring = "recall")

print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (recall.mean(), recall.std() * 2))

<span style="color:blue"> Here we use the builtin function `predict_proba` to evaluate the probability that each subject belongs in treat or non treat class. The probability of belonging to the positive class (treat == 1) gives the propensy score.

In [None]:
prediction = logistic.predict_proba(X) 
print(logistic.classes_) # to see which column corresponds to which class, the positive case corresponds to column 1
lalonde['propensy_score'] = pd.Series(prediction[:,1], index = lalonde.index)
lalonde.head()

## 4. Balancing the dataset via matching

Use the propensity scores to match each data point from the treated group with exactly one data point from the control group, while ensuring that each data point from the control group is matched with at most one data point from the treated group.
(Hint: you may explore the `networkx` package in Python for predefined matching functions.)

Your matching should maximize the similarity between matched subjects, as captured by their propensity scores.
In other words, the sum (over all matched pairs) of absolute propensity-score differences between the two matched subjects should be minimized.

After matching, you have as many treated as you have control subjects.
Compare the outcomes (`re78`) between the two groups (treated and control).

Also, compare again the feature-value distributions between the two groups, as you've done in part 2 above, but now only for the matched subjects.
What do you observe?
Are you closer to being able to draw valid conclusions now than you were before?

<span style="color:blue"> In order to match the data from the treated group with exactly one data point from the control group, we use the `networkx` package. The `bipartite` module allows us to create a bipartite graph with two node sets (treated and non treated) and edges that only connect nodes from opposite sets. Nodes correspond to indexes of the lalong dataset. Edges correspond to all possible connections between nodes from the treated side and nodes from the untreated side. Each edge has a weight wich corresponds to `1 - abs(propensy_score(node1) - propensy_score(node2))`. Therefore the weight interval is [0,1], with 1 being the highest weight.

In [None]:
G = nx.Graph()
G.add_nodes_from(lalonde.index[lalonde.treat == 0], bipartite = 0)
G.add_nodes_from(lalonde.index[lalonde.treat == 1], bipartite = 1)

# Add edges with weight, as the max weight matching function will match with maximum weight, we do 1 - the difference of propensy score
# This means the weight is maximum if 1 and minimum if 0.
for node0, weight0 in zip(lalonde.index[lalonde.treat == 0], lalonde.propensy_score[lalonde.treat == 0]):
    for node1, weigth1 in zip(lalonde.index[lalonde.treat == 1], lalonde.propensy_score[lalonde.treat == 1]):
        G.add_edge(node0, node1, weight = 1- abs(weight0 - weigth1))

<span style="color:blue">We compute a maximum weight matching (using the `max_weight_matching` function) using edge weights. This matching ensures no node occurs more than once. The `maxcardinality` parameter is set to true in order to compute the maximum-cardinality matching with maximum weight among all maximum-cardinality matchings.

In [None]:
match = nx.max_weight_matching(G, maxcardinality = True)

<span style="color:blue"> Once we have the matching, we can extract the two sets of matched nodes.

In [None]:
bottom_nodes, top_nodes = bipartite.sets(G)

<span style="color:blue"> We now create a new dataset containing only the matched elements. This will allow us to compute the same observations as done in points 1.1 and 1.2.

In [None]:
# Compare outcome re78 between the two groups
lalonde_matched = lalonde.loc[list(match.keys())]

ax = lalonde_matched.boxplot(column = ['re78'], by = ['treat'] ,figsize=(12,6))
ax.set_xlabel('Program completion')
ax.set_ylabel('Revenue in 1978')
ax.set_title('Revenue in 1978 as a function of program completion')

In [None]:
for features in ['age', 'educ', 're74', 're75']:
    ax = lalonde_matched.boxplot(column = [features], by = ['treat'] ,figsize=(12,6))
    ax.set_xlabel('Program completion')
    ax.set_ylabel(features)
    ax.set_title(features+ ' as a function of program completion')

In [None]:
treated = lalonde_matched[lalonde_matched['treat']==1]
no_treated = lalonde_matched[lalonde_matched['treat']==0]

a=0
b=0
legend = [['black', 'hispanic', 'white'],['not married', 'married'],['no degree', 'degree']]

for feature in ['race', 'married', 'degree']:
    pie = treated.groupby(feature).count().plot.pie(y='treat', autopct="%2f").legend(labels=legend[a] , loc="best")
    plt.title(feature + ' repartition in subjects who completed the program')
    a+=1
    
for feature in ['race', 'married', 'degree']:
    pie = no_treated.groupby(feature).count().plot.pie(y='treat', autopct="%2f").legend(labels=legend[b] , loc="best")
    plt.title(feature + ' repartition in subjects who did not completed the program')
    b+=1

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.countplot(x="race", hue="treat", data=lalonde_matched)

<span style="color:blue">We observe that with the new dataset, the median of *re78* is slightly lower for the tratment group, and the distribution is similar. The age distribution changed for the control group, median age went from 25 to 21 years old. *educ* still has a similar distribution, and *re74* is still higher for the control group. The distribution became less extensive and the median re74 decreased for the control group. Similar observations can be made for *re75*. 

<span style="color:blue">Marital status and degree obtention still show differences, but are now more similar between groups. There are still huge differences in race repartition between the two groups. We can decrease the bias in this dataset by grouping only subjects of the same race.

## 5. Balancing the groups further

Based on your comparison of feature-value distributions from part 4, are you fully satisfied with your matching?
Would you say your dataset is sufficiently balanced?
If not, in what ways could the "balanced" dataset you have obtained still not allow you to draw valid conclusions?

Improve your matching by explicitly making sure that you match only subjects that have the same value for the problematic feature.
Argue with numbers and plots that the two groups (treated and control) are now better balanced than after part 4.

<span style="color:blue">The most problematic feature is the *race* attribute. It can be improved by selecting only the matched elements that have the same race value.

In [None]:
lalonde_cleaned_index = []
l = len(lalonde_matched) -1

for i in range(0,l,2):
    if lalonde_matched.iloc[i].race == lalonde_matched.iloc[i+1].race:
        lalonde_cleaned_index.append(lalonde_matched.iloc[i].name)
        lalonde_cleaned_index.append(lalonde_matched.iloc[i+1].name)

lalonde_cleaned_index
lalonde_cleaned = lalonde_matched.loc[lalonde_cleaned_index]
lalonde_cleaned.head()

<span style="color:blue"> Let's see what we now have...

In [None]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', RandomForestClassifier()),
])

<span style="color:blue"> We define the parameters and value ranges we are interested in for the grid search.

<span style="color:blue"> For this assignment, we use the following:
* <span style="color:blue">`n_estimators`: ranges from $2^0$ to $2^{10}$.
* <span style="color:blue">`max_depth`: ranges from 10 to 200 in increments of 10.

<span style="color:blue">These values represent a trade-off between grid search execution time and quality of the parameters.

In [None]:
parameters = {
    'clf__n_estimators': list(map(lambda n: 2**n, range(0,11))),
    'clf__max_depth': list(range(10,210,10))
}

<span style="color:blue">We perform the grid search with 5-fold cross-validation...

In [None]:
grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)

print('Starting grid search...')
start = time()
grid_search.fit(validation_data, validation_observations)
end = time()
print('Grid search completed in %0.3fs' % (end - start))

<span style="color:blue">... and the best score we have achieved is:

In [None]:
grid_search.best_score_

<span style="color:blue">Using the following parameter values:

In [None]:
best_parameters = grid_search.best_estimator_.get_params()

print('''Best values:
\tn_estimators = %d
\tmax_depth = %d''' % (best_parameters['clf__n_estimators'], best_parameters['clf__max_depth']))

<span style="color:blue">Now that we have the parameters fine-tuned, we apply them to the model.

In [None]:
pipeline.set_params(clf__n_estimators=best_parameters['clf__n_estimators'], clf__max_depth=best_parameters['clf__max_depth'])

<span style="color:blue">Let's train the pipeline using our training subset...

In [None]:
pipeline.fit(training_data, training_observations).score(training_data, training_observations)

<span style="color:blue">And now let's see how well we do on the testing dataset.

In [None]:
fig, ax1 = plt.subplots(figsize=(12,8))
ax1 = sns.countplot(x="race", hue="treat", data=lalonde_cleaned)

fig, ax2 = plt.subplots(figsize=(12,8))
ax2 = sns.countplot(x="married", hue="treat", data=lalonde_cleaned)

fig, ax3 = plt.subplots(figsize=(12,8))
ax3 = sns.countplot(x="degree", hue="treat", data=lalonde_cleaned)

<span style="color:blue"> Looking at the graphs above, we can see that the two populations are very close with respect to race, education and marriage. We can now plot them again.

In [None]:
pipeline.score(testing_data, testing_observations)

<span style="color:blue">We get an accuracy of 66%. Cool :-)

<span style="color:blue">We now plot the confusion matrix of our pipeline.

In [None]:
import itertools
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

np.set_printoptions(precision=2)

<span style="color:blue">Here is a helper function to draw the plot. Adapted from [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html).

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

<span style="color:blue">Now we compute the confusion matrix. Simply use the trained model to predict classes for the testing data and then compare with the actual ground truth.

In [None]:
testing_predicted = pipeline.predict(testing_data)
cnf_matrix = confusion_matrix(testing_observations, testing_predicted)

<span style="color:blue">Here comes the plot...

In [None]:
plt.figure(figsize=(12,12))
plot_confusion_matrix(cnf_matrix, classes=categories, normalize=True, title='Confusion matrix')
plt.show()

<span style="color:blue">The diagonal shows the proportion of correctly classified elements. As we can see, this is pretty good, indicating that the classes of many elements were correctly predicted. Nevertheless, there seems to be a non-trivial amount of confusion between a few classes. We might be able to improve this if we had more data, if we could fine-tune the parameters even better, or if we could refine the features. Unfortunately, this is beyond the scope of this assignment.

<span style="color:blue">Finally, we take a closer look at the important features. We start with a simple scatter plot of all features along with their importance.

In [None]:
for feature in ['age', 'educ', 're74', 're75']:
    ax = lalonde_cleaned.boxplot(column = [features], by = ['treat'] ,figsize=(12,6))
    ax.set_xlabel('Program completion')
    ax.set_ylabel(feature)
    ax.set_title(feature + ' as a function of program completion')

<span style="color:blue">The two populations are now similar according to race, and still pretty similar according to marital status and degree obtention. The current dataset is less biased than before.

## 6. A less naive analysis

Compare the outcomes (`re78`) between treated and control subjects, as you've done in part 1, but now only for the matched dataset you've obtained from part 5.
What do you conclude about the effectiveness of the job training program?

In [None]:
# Original boxplots as a reminder 
ax = lalonde.boxplot(column = ['re78'], by = ['treat'] ,figsize=(12,6))
ax.set_xlabel('Program completion')
ax.set_ylabel('Revenue in 1978')
ax.set_title('Revenue in 1978 as a function of program completion, biased dataset')

# Compare outcome re78 between the two groups when the race is set to be the same
ax = lalonde_cleaned.boxplot(column = ['re78'], by = ['treat'] ,figsize=(12,6))
ax.set_xlabel('Program completion')
ax.set_ylabel('Revenue in 1978')
ax.set_title('Revenue in 1978 as a function of program completion, clean dataset')

lalonde_cleaned.describe()

<span style="color:blue">Both distribution are less extensive (mind the scale!), and the median real earning of the treatment group is now higher than for the control group, even if both medians are now lower than in the original dataset. The 25%-quartile is nonzero for the treatment group, meaning that more people now have a job while it was not the case in 1974 and 1975, before the formation program. We can now conclude with reasonable certainty that the program had a positive effect on real earnings.

# Question 2: Applied ML

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

## 1.

Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).

<span style="color:blue"> Let's start with a bunch of imports.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from pprint import pprint

<span style="color:blue"> We get the data, stripping headers, footers, and quotes.

In [None]:
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
categories = list(newsgroups.target_names)

<span style="color:blue"> We then compute the TF-IDF features for all articles.

In [None]:
forest = pipeline.steps[1][1]
importances = forest.feature_importances_
plt.figure(figsize=(12,12))
plt.scatter(range(len(importances)), importances)
plt.show()

<span style="color:blue">As we can see, a large majority of the features have very low importance or no importance at all. There are very few outliers in the plot.

<span style="color:blue">Let's see how many features have a score $\ge 0.001$...

In [None]:
ax = lalonde.boxplot(column=['re78'], by=['treat'], figsize=(12,6))
ax.set_xlabel('Program completion')
ax.set_ylabel('RE in 1978')
ax.set_title('Real earnings in 1978 as a function of program completion')

<span style="color:blue">At first glance, we can observe a similar distribution and an equivalent median between the two groups. A naive interpretation would be that the treatment had no effect on the outcome (real earnings in 1978).

## 2. A closer look at the data

You're not naive, of course (and even if you are, you've learned certain things in ADA), so you aren't content with a superficial analysis such as the above.
You're aware of the dangers of observational studies, so you take a closer look at the data before jumping to conclusions.

For each feature in the dataset, compare its distribution in the treated group with its distribution in the control group, using plots and numbers.
As above, you may use the techniques we discussed in class for summarizing and comparing the distributions.

What do you observe?
Describe what your observations mean for the conclusions drawn by the naive "researcher" from his superficial analysis.

<span style="color:blue">We display several boxplots, one for each feature, grouped by the *treat* feature.

In [None]:
for feature in ['age', 'educ', 're74', 're75']:
    ax = lalonde.boxplot(column=[feature], by=['treat'], figsize=(12,6))
    ax.set_xlabel('Program completion')
    ax.set_ylabel(feature)
    ax.set_title(feature + ' as a function of program completion')

<span style="color:blue">We also compute several pie charts for the tertiary/binary features *race*, *married* and *degree*.

In [None]:
a=0
b=0
legend = [['black', 'hispanic', 'white'],['not married', 'married'],['no degree', 'degree']]

for feature in ['race', 'married', 'degree']:
    pie = no_treat.groupby(feature).count().plot.pie(y='treat', autopct="%2f").legend(labels=legend[a] , loc="best")
    plt.title(feature + ' repartition in subjects who did not complete the program')
    a+=1
    
for feature in ['race', 'married', 'degree']:
    pie = treat.groupby(feature).count().plot.pie(y='treat', autopct="%2f").legend(labels=legend[b] , loc="best")
    plt.title(feature + ' repartition in subjects who completed the program')
    b+=1    

<span style="color:blue">Regarding age, we can observe an equivalent median, with a larger distribution for the control group. Number of education years show the same median value, with similar distribution across groups. 

<span style="color:blue">However, there is a clear difference for real earnings in 1974 and 1975. The median of *re74* is 0 for the treatment group and higher for the control group, meaning that probably a lot of people in the treatment group had originally no job. It also has a lower 75-quartile: real earnings were originally much lower for people who ended up following the program years later. The same observation can be made for *re75*, even if the treatment group shows an increased 75-quartile, while the distribution for the control group shrinked. The median is still 0 for the treatment group and lower for the control group.

<span style="color:blue">The control group shows great disparity within race, people are mainly white (65%). Marital status is evenly distributed, and 60% of people have no degree. The treatment group shows even more disparity within race, 84% of people being black; marital status and degree obtention are different as well. 

<span style="color:blue">As a first observation, we could say that the program permitted people to get a job and/or increase their real earnings, but the two groups are too different and biased to draw conclusions with any certainty.

## 3. A propensity score model

Use logistic regression to estimate propensity scores for all points in the dataset.
You may use `sklearn` to fit the logistic regression model and apply it to each data point to obtain propensity scores:

```python
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
```

Recall that the propensity score of a data point represents its probability of receiving the treatment, based on its pre-treatment features (in this case, age, education, pre-treatment income, etc.).
To brush up on propensity scores, you may read chapter 3.3 of the above-cited book by Rosenbaum or [this article](https://drive.google.com/file/d/0B4jctQY-uqhzTlpBaTBJRTJFVFE/view).

Note: you do not need a train/test split here. Train and apply the model on the entire dataset. If you're wondering why this is the right thing to do in this situation, recall that the propensity score model is not used in order to make predictions about unseen data. Its sole purpose is to balance the dataset across treatment groups.
(See p. 74 of Rosenbaum's book for an explanation why slight overfitting is even good for propensity scores.
If you want even more information, read [this article](https://drive.google.com/file/d/0B4jctQY-uqhzTlpBaTBJRTJFVFE/view).)

<span style="color:blue">We create a train the linear regression model. This model will enable us to obtain propensity scores for all points in the dataset. 

In [None]:
feature_cols = ['age', 'educ', 'married', 'race', 'degree', 're74', 're75', 're78']
X = lalonde[feature_cols]
y = lalonde.treat

logistic = linear_model.LogisticRegression()
logistic.fit(X,y) # train the model

In [None]:
vectors = TfidfVectorizer().fit_transform(newsgroups.data)

<span style="color:blue"> Let's see what that looks like...