# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
import pandas as pd

In [3]:
from datetime import datetime as dt

In [4]:
url = 'http://www.mineral.k12.nv.us/pages/School_Board_Minutes'
response = requests.get(url, verify=False)
doc = BeautifulSoup(response.text)

In [5]:
meetings = doc.find_all('a')[22:65]

date_list = []
for date in meetings:
    if date == 'style="background-color: #ffffff;' or 'style=color: #000000;"':
        if date != '\n':
            date_list.append(date.text)

In [6]:
df = pd.DataFrame()
df['Date'] = date_list

In [7]:
df['Date'] = df.Date.replace("\xa0", "")

In [8]:
df['Date']

0            June 4, 2019
1            May 28, 2019
2             May 7, 2019
3          April 23, 2019
4           April 8, 2019
5          March 19, 2019
6           March 5, 2019
7       February 26, 2019
8        February 5, 2019
9        January 22, 2019
10        January 8, 2019
11      December 20, 2018
12       December 4, 2018
13      November 20, 2018
14     September 25, 2018
15    September 13, 2018 
16     September 4, 2018 
17       August 21, 2018 
18        August 7, 2018 
19          July 24, 2018
20        July 10, 2018  
21         June 28, 2018 
22         June 22, 2018 
23         June 21, 2018 
24         June 19, 2108 
25          May 29, 2018 
26         April 17, 2018
27         April 2, 2018 
28          March 8, 2018
29          March 6, 2018
30     February 20, 2018 
31      February 6, 2018 
32       January 16, 2018
33                       
34        January 5, 2017
35       January 26, 2017
36       February 2, 2017
37      February 16, 2017
38          

In [9]:
df['Date'] = pd.to_datetime(df['Date'], format="%B %d, %Y", errors='coerce')

In [10]:
pdf_list = []
for pdf in meetings:
    pdf_list.append(pdf.attrs['href'])

In [11]:
data = list(zip(df['Date'], pdf_list))
str(data)

"[(Timestamp('2019-06-04 00:00:00'), '/files/6.4.19_minutes.pdf'), (Timestamp('2019-05-28 00:00:00'), '/files/5.28.19_minutes.pdf'), (Timestamp('2019-05-07 00:00:00'), '/files/5.7.19_minutes.pdf'), (Timestamp('2019-04-23 00:00:00'), '/files/4.23.19_minutes.pdf'), (Timestamp('2019-04-08 00:00:00'), '/files/4.8.19_minutes.pdf'), (Timestamp('2019-03-19 00:00:00'), '/files/3.5.19_minutes.pdf'), (Timestamp('2019-03-05 00:00:00'), '/files/3.5.19.pdf'), (Timestamp('2019-02-26 00:00:00'), '/files/2.26.19_minutes.pdf'), (Timestamp('2019-02-05 00:00:00'), '/files/2.5.19_minutes.pdf'), (Timestamp('2019-01-22 00:00:00'), '/files/January_22_minutes.pdf'), (Timestamp('2019-01-08 00:00:00'), '/files/January_8_minutes.pdf'), (Timestamp('2018-12-20 00:00:00'), '/files/12.20.18_minutes.pdf'), (Timestamp('2018-12-04 00:00:00'), '/files/12.4.18_minutes.pdf'), (Timestamp('2018-11-20 00:00:00'), '/files/11.20.18.pdf'), (Timestamp('2018-09-25 00:00:00'), '/files/9.25.18_minutes.pdf'), (NaT, '/files/9.13.18_m

In [12]:
pd.DataFrame(data, columns=['Date','PDF'])

Unnamed: 0,Date,PDF
0,2019-06-04,/files/6.4.19_minutes.pdf
1,2019-05-28,/files/5.28.19_minutes.pdf
2,2019-05-07,/files/5.7.19_minutes.pdf
3,2019-04-23,/files/4.23.19_minutes.pdf
4,2019-04-08,/files/4.8.19_minutes.pdf
5,2019-03-19,/files/3.5.19_minutes.pdf
6,2019-03-05,/files/3.5.19.pdf
7,2019-02-26,/files/2.26.19_minutes.pdf
8,2019-02-05,/files/2.5.19_minutes.pdf
9,2019-01-22,/files/January_22_minutes.pdf


In [14]:
df

Unnamed: 0,Date
0,2019-06-04
1,2019-05-28
2,2019-05-07
3,2019-04-23
4,2019-04-08
5,2019-03-19
6,2019-03-05
7,2019-02-26
8,2019-02-05
9,2019-01-22


In [13]:
df.to_csv('school_board_minutes', index=False)