# Test Dataset Scraping and Cleaning

In this notebook, we pull the test dataset from the New York Times interviews of the Democratic Candidates. We will also be pulling a Trump tweet dataset to see an example of views on the other end of the political spectrum. After running scrape_test, we manually delete the beginning and ending rows in the output CSV that correspond to formatting. The next function, address_short_strings, puts strings smaller than 50 characters into the previous row to provide additional context. This result goes in cavndidatename_cleaned.csv (raw data is in candidate.csv)

Note that we may consolidate all the rows into one single record for testing. That is an option we can run at testing time. We may opt to run each of these records separately and do post processing to get the maximum value for a label over all records for a given candidate (since not all rows will contain information about all issues). This may help with ensuring models such as RNNs do not "forget" about certain topics in earlier sections of the passage.

In [18]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

def scrape_test(candidate, url):
    headers = requests.utils.default_headers()
    headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
    
    texts = []
    req = requests.get(url, headers)
    soup = BeautifulSoup(req.content, 'html.parser')
    
    s = soup.get_text()
    all_strings = s.split("\n")
    body = [string for string in all_strings if len(string) > 0]
    texts.extend(body)
    
    
    df = pd.DataFrame(texts, columns = ['text']) 
    print(df.head())
    df.to_csv(candidate + ".csv")

In [19]:
scrape_test("Biden", "https://www.nytimes.com/interactive/2020/01/17/opinion/joe-biden-nytimes-interview.html")

                                                text
0  Opinion | Joe Biden Says Age Is Just a Number ...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                                        // 20.546kB
4                  window.viHeadScriptSize = 20.546;


In [20]:
scrape_test("Sanders", "https://www.nytimes.com/interactive/2020/01/13/opinion/bernie-sanders-nytimes-interview.html")

                                                text
0  Opinion | Bernie Sanders Wants to Change Your ...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                                        // 20.546kB
4                  window.viHeadScriptSize = 20.546;


In [21]:
scrape_test("Buttigieg", "https://www.nytimes.com/interactive/2020/01/16/opinion/pete-buttigieg-nytimes-interview.html")


                                                text
0  Opinion | Pete Buttigieg Says He’s More Than a...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                                        // 20.546kB
4                  window.viHeadScriptSize = 20.546;


In [22]:
scrape_test("Klobuchar", "https://www.nytimes.com/interactive/2020/01/15/opinion/amy-klobuchar-nytimes-interview.html")


                                                text
0  Opinion | Amy Klobuchar on Plans vs. Pipe Drea...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                                        // 20.546kB
4                  window.viHeadScriptSize = 20.546;


In [23]:
scrape_test("Yang", "https://www.nytimes.com/interactive/2020/01/15/opinion/andrew-yang-nytimes-interview.html")


                                                text
0  Opinion | Andrew Yang Is Listening - The New Y...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                                        // 20.546kB
4                  window.viHeadScriptSize = 20.546;


In [24]:
scrape_test("Warren", "https://www.nytimes.com/interactive/2020/01/14/opinion/elizabeth-warren-nytimes-interview.html")


                                                text
0  Opinion | Elizabeth Warren Is Ready for a Figh...
1                  [data-timezone] { display: none }
2  .css-6n7j50{display:inline;}.css-1kj7lfb{displ...
3                                        // 20.546kB
4                  window.viHeadScriptSize = 20.546;


In [118]:
def address_short_strings(candidate):
    df = pd.read_csv(candidate + ".csv")
    last_valid_row = 0
    for index, row in df.iterrows():
        if index > 0 and len(row.text) < 50:
            print(index, last_valid_row)
            df.at[last_valid_row,'text'] = df.loc[last_valid_row]["text"] + ' ' + row.text
            df.drop(index, inplace=True)
        else:
            last_valid_row = index
    df = df[["text"]]
    df.to_csv(candidate + "_cleaned.csv")
    return(df)

In [119]:
address_short_strings("Biden")

4 3
13 12
17 16
19 18
30 29
40 39
41 39
54 53
57 56
62 61
67 66
70 69
72 71
73 71
74 71
94 93
98 97
102 101
103 101
104 101
106 105
109 108
110 108
111 108
118 117
119 117
123 122
124 122
125 122
126 122
134 133
135 133
137 136
138 136
139 136
143 142
144 142
145 142
155 154
156 154
157 154
159 158
161 160
168 167
175 174
176 174
181 180
182 180
187 186
190 189
193 192
194 192
195 192
199 198
200 198
201 198
204 203
209 208
212 211
213 211
220 219
233 232
242 241
245 244
246 244
249 248
257 256
259 258
261 260
262 260
266 265
267 265
268 265
269 265
286 285
287 285
288 285
292 291
297 296
306 305
307 305
310 309
311 309
316 315
328 327
336 335
337 335
338 335
347 346
351 350
354 353
355 353
358 357
367 366
368 366
378 377
380 379
385 384
390 389
394 393
399 398
403 402
406 405
417 416
420 419
443 442
447 446
458 457
465 464
466 464
471 470
473 472
475 474
476 474
489 488
525 524
531 530
532 530
533 530
536 535
539 538
540 538
541 538
543 542
544 542


Unnamed: 0,text
0,Opinion | Joe Biden Says Age Is Just a Number ...
1,He also discussed the “creeps” in Silicon Vall...
2,"Here is a transcript, with annotations in blue..."
3,"Kathleen Kingsbury: So Mr. Vice President, we’..."
5,"KK: We have a lot of questions to get through,..."
6,"In the October Democratic debate, Mr. Biden wa..."
7,"Look, I fought corruption when I was in Ukrain..."
8,The Times editorial board wrote in 2015: “It s...
9,He’s acknowledged that he thought it was a mis...
10,KK: Would you be in favor of a law banning the...


In [120]:
address_short_strings("Sanders")
address_short_strings("Buttigieg")
address_short_strings("Klobuchar")
address_short_strings("Yang")
address_short_strings("Warren")

7 6
19 18
21 20
23 22
31 30
33 32
35 34
36 34
38 37
49 48
59 58
63 62
65 64
68 67
73 72
83 82
112 111
115 114
116 114
117 114
122 121
124 123
135 134
139 138
144 143
147 146
150 149
151 149
155 154
156 154
158 157
163 162
168 167
172 171
178 177
179 177
184 183
185 183
186 183
191 190
192 190
211 210
215 214
231 230
232 230
236 235
252 251
253 251
269 268
276 275
283 282
285 284
286 284
287 284
288 284
289 284
291 290
299 298
304 303
308 307
309 307
310 307
311 307
314 313
315 313
322 321
323 321
324 321
325 321
339 338
350 349
351 349
357 356
358 356
359 356
365 364
369 368
372 371
384 383
385 383
387 386
392 391
394 393
395 393
397 396
398 396
399 396
400 396
402 401
409 408
3 2
9 8
10 8
18 17
20 19
22 21
65 64
67 66
72 71
78 77
89 88
102 101
116 115
119 118
132 131
133 131
136 135
137 135
139 138
140 138
141 138
144 143
145 143
146 143
147 143
148 143
165 164
169 168
170 168
173 172
175 174
186 185
188 187
189 187
191 190
192 190
193 190
200 199
201 199
213 212
224 223
228 227
230 2

Unnamed: 0,text
0,Opinion | Elizabeth Warren Is Ready for a Figh...
1,"Elizabeth Warren, in her interview with the Ti..."
2,But a few questions caught her off guard: how ...
3,"Here is a transcript, with annotations in blue..."
4,Kathleen Kingsbury: So we don’t have very much...
6,KK: Should it be against the law for the child...
7,Several members of the Trump family have come ...
8,"You know, when I put together my anti-corrupti..."
9,Senator Warren’s anti-corruption plan includes...
11,Yes. I just think it’s got this — no one has t...
