Skip to content

Commit 0c9129c

Browse files
authored
course3
1 parent ec9a4b0 commit 0c9129c

15 files changed

+32775
-0
lines changed

Applied Machine Learning in Python/Assignment 1.ipynb

Lines changed: 2346 additions & 0 deletions
Large diffs are not rendered by default.

Applied Machine Learning in Python/Assignment 2.ipynb

Lines changed: 513 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 350 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,350 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"---\n",
8+
"\n",
9+
"_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._\n",
10+
"\n",
11+
"---"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"# Assignment 3 - Evaluation\n",
19+
"\n",
20+
"In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).\n",
21+
" \n",
22+
"Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. \n",
23+
" \n",
24+
"The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud."
25+
]
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": 1,
30+
"metadata": {
31+
"collapsed": true
32+
},
33+
"outputs": [],
34+
"source": [
35+
"import numpy as np\n",
36+
"import pandas as pd"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"### Question 1\n",
44+
"Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?\n",
45+
"\n",
46+
"*This function should return a float between 0 and 1.* "
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": 2,
52+
"metadata": {
53+
"collapsed": true
54+
},
55+
"outputs": [],
56+
"source": [
57+
"def answer_one():\n",
58+
" \n",
59+
" # Your code here\n",
60+
" \n",
61+
" return df.iloc[:,-1].mean()# Return your answer\n"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": 3,
67+
"metadata": {
68+
"collapsed": true
69+
},
70+
"outputs": [],
71+
"source": [
72+
"# Use X_train, X_test, y_train, y_test for all of the following questions\n",
73+
"from sklearn.model_selection import train_test_split\n",
74+
"\n",
75+
"df = pd.read_csv('fraud_data.csv')\n",
76+
"\n",
77+
"X = df.iloc[:,:-1]\n",
78+
"y = df.iloc[:,-1]\n",
79+
"\n",
80+
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
81+
]
82+
},
83+
{
84+
"cell_type": "markdown",
85+
"metadata": {},
86+
"source": [
87+
"### Question 2\n",
88+
"\n",
89+
"Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?\n",
90+
"\n",
91+
"*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*"
92+
]
93+
},
94+
{
95+
"cell_type": "code",
96+
"execution_count": 4,
97+
"metadata": {
98+
"collapsed": true
99+
},
100+
"outputs": [],
101+
"source": [
102+
"def answer_two():\n",
103+
" from sklearn.dummy import DummyClassifier\n",
104+
" from sklearn.metrics import recall_score\n",
105+
" clf = DummyClassifier(\"most_frequent\",random_state = 0)\n",
106+
" clf.fit(X_train,y_train)\n",
107+
" acc = clf.score(X_test,y_test)\n",
108+
" recall = recall_score(y_test,clf.predict(X_test),'binary')\n",
109+
" # Your code here\n",
110+
" \n",
111+
" return (acc,recall)# Return your answer"
112+
]
113+
},
114+
{
115+
"cell_type": "markdown",
116+
"metadata": {},
117+
"source": [
118+
"### Question 3\n",
119+
"\n",
120+
"Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?\n",
121+
"\n",
122+
"*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": 5,
128+
"metadata": {
129+
"collapsed": true
130+
},
131+
"outputs": [],
132+
"source": [
133+
"def answer_three():\n",
134+
" from sklearn.metrics import recall_score, precision_score\n",
135+
" from sklearn.svm import SVC\n",
136+
" \n",
137+
" # Your code here\n",
138+
" clf = SVC()\n",
139+
" clf.fit(X_train,y_train)\n",
140+
" acc = clf.score(X_test,y_test)\n",
141+
" recall = recall_score(y_test,clf.predict(X_test),'binary')\n",
142+
" precision = precision_score(y_test,clf.predict(X_test),'binary')\n",
143+
" \n",
144+
" return (acc,recall,precision)# Return your answer"
145+
]
146+
},
147+
{
148+
"cell_type": "markdown",
149+
"metadata": {},
150+
"source": [
151+
"### Question 4\n",
152+
"\n",
153+
"Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.\n",
154+
"\n",
155+
"*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*"
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": 6,
161+
"metadata": {
162+
"collapsed": true
163+
},
164+
"outputs": [],
165+
"source": [
166+
"def answer_four():\n",
167+
" from sklearn.metrics import confusion_matrix\n",
168+
" from sklearn.svm import SVC\n",
169+
"\n",
170+
" # Your code here\n",
171+
" clf = SVC(C=1e9,gamma=1e-07)\n",
172+
" clf.fit(X_train,y_train)\n",
173+
" temp = clf.decision_function(X_test)\n",
174+
" ans = confusion_matrix(y_test,np.greater(temp,-220),)\n",
175+
" \n",
176+
" return ans# Return your answer"
177+
]
178+
},
179+
{
180+
"cell_type": "markdown",
181+
"metadata": {},
182+
"source": [
183+
"### Question 5\n",
184+
"\n",
185+
"Train a logisitic regression classifier with default parameters using X_train and y_train.\n",
186+
"\n",
187+
"For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).\n",
188+
"\n",
189+
"Looking at the precision recall curve, what is the recall when the precision is `0.75`?\n",
190+
"\n",
191+
"Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?\n",
192+
"\n",
193+
"*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*"
194+
]
195+
},
196+
{
197+
"cell_type": "code",
198+
"execution_count": 7,
199+
"metadata": {
200+
"collapsed": true
201+
},
202+
"outputs": [],
203+
"source": [
204+
"def answer_five():\n",
205+
" \n",
206+
" # Your code here\n",
207+
" from sklearn.linear_model import LogisticRegression\n",
208+
" from sklearn.metrics import precision_recall_curve\n",
209+
" from sklearn.metrics import roc_curve\n",
210+
" #import matplotlib.pyplot as plt\n",
211+
" #%matplotlib inline\n",
212+
" \n",
213+
" clf = LogisticRegression(n_jobs=-1)\n",
214+
" clf.fit(X_train,y_train)\n",
215+
" p,r,_ = precision_recall_curve(y_test,clf.predict_proba(X_test)[:,1],)\n",
216+
" fpr,tpr,_ = roc_curve(y_test,clf.predict_proba(X_test)[:,1],)\n",
217+
" \n",
218+
" #plt.plot(fpr,tpr)\n",
219+
" #plt.xlabel('fpr')\n",
220+
" #plt.ylabel('tpr')\n",
221+
" #plt.show()\n",
222+
" return (0.8,0.9)# Return your answer"
223+
]
224+
},
225+
{
226+
"cell_type": "markdown",
227+
"metadata": {},
228+
"source": [
229+
"### Question 6\n",
230+
"\n",
231+
"Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.\n",
232+
"\n",
233+
"`'penalty': ['l1', 'l2']`\n",
234+
"\n",
235+
"`'C':[0.01, 0.1, 1, 10, 100]`\n",
236+
"\n",
237+
"From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.\n",
238+
"\n",
239+
"| \t| `l1` \t| `l2` \t|\n",
240+
"|:----:\t|----\t|----\t|\n",
241+
"| **`0.01`** \t| ?\t| ? \t|\n",
242+
"| **`0.1`** \t| ?\t| ? \t|\n",
243+
"| **`1`** \t| ?\t| ? \t|\n",
244+
"| **`10`** \t| ?\t| ? \t|\n",
245+
"| **`100`** \t| ?\t| ? \t|\n",
246+
"\n",
247+
"<br>\n",
248+
"\n",
249+
"*This function should return a 5 by 2 numpy array with 10 floats.* \n",
250+
"\n",
251+
"*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array.*"
252+
]
253+
},
254+
{
255+
"cell_type": "code",
256+
"execution_count": 8,
257+
"metadata": {
258+
"collapsed": true
259+
},
260+
"outputs": [],
261+
"source": [
262+
"def answer_six(): \n",
263+
" from sklearn.model_selection import GridSearchCV\n",
264+
" from sklearn.linear_model import LogisticRegression\n",
265+
"\n",
266+
" # Your code here\n",
267+
" params = {'penalty': ['l1', 'l2'],\n",
268+
" 'C':[0.01, 0.1, 1, 10, 100]}\n",
269+
" clf = GridSearchCV(LogisticRegression(n_jobs=-1),params,scoring='recall',n_jobs=-1)\n",
270+
" clf.fit(X_train,y_train)\n",
271+
" ans = clf.cv_results_['mean_test_score'].reshape(5,2)\n",
272+
" return ans# Return your answer"
273+
]
274+
},
275+
{
276+
"cell_type": "code",
277+
"execution_count": 9,
278+
"metadata": {
279+
"collapsed": true
280+
},
281+
"outputs": [],
282+
"source": [
283+
"# Use the following function to help visualize results from the grid search\n",
284+
"def GridSearch_Heatmap(scores):\n",
285+
" %matplotlib notebook\n",
286+
" import seaborn as sns\n",
287+
" import matplotlib.pyplot as plt\n",
288+
" plt.figure()\n",
289+
" sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])\n",
290+
" plt.yticks(rotation=0);\n",
291+
"\n",
292+
"#GridSearch_Heatmap(answer_six())"
293+
]
294+
},
295+
{
296+
"cell_type": "code",
297+
"execution_count": null,
298+
"metadata": {
299+
"collapsed": true
300+
},
301+
"outputs": [],
302+
"source": []
303+
},
304+
{
305+
"cell_type": "code",
306+
"execution_count": null,
307+
"metadata": {
308+
"collapsed": true
309+
},
310+
"outputs": [],
311+
"source": []
312+
},
313+
{
314+
"cell_type": "code",
315+
"execution_count": null,
316+
"metadata": {
317+
"collapsed": true
318+
},
319+
"outputs": [],
320+
"source": []
321+
}
322+
],
323+
"metadata": {
324+
"coursera": {
325+
"course_slug": "python-machine-learning",
326+
"graded_item_id": "5yX9Z",
327+
"launcher_item_id": "eqnV3",
328+
"part_id": "Msnj0"
329+
},
330+
"kernelspec": {
331+
"display_name": "Python 3",
332+
"language": "python",
333+
"name": "python3"
334+
},
335+
"language_info": {
336+
"codemirror_mode": {
337+
"name": "ipython",
338+
"version": 3
339+
},
340+
"file_extension": ".py",
341+
"mimetype": "text/x-python",
342+
"name": "python",
343+
"nbconvert_exporter": "python",
344+
"pygments_lexer": "ipython3",
345+
"version": "3.6.2"
346+
}
347+
},
348+
"nbformat": 4,
349+
"nbformat_minor": 2
350+
}

0 commit comments

Comments
 (0)