# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

In [1]:
import requests
import re
import csv
import pandas as pd
import numpy as np
from datetime import datetime
from bs4 import BeautifulSoup

In [2]:
url = "http://www.mineral.k12.nv.us/pages/School_Board_Minutes"
raw_html = requests.get(url).content
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(type(soup_doc))

<class 'bs4.BeautifulSoup'>


In [3]:
### TEST VIEWS OF DATA
# raw_html
# print(soup_doc)
# print(soup_doc.prettify())

In [4]:
### WORKING WITH DELETED ROWS
minutes_list = []
paras = soup_doc.find_all('p')[19:78]
for para in paras[1:]:
    minutes = []
    date = para.text.replace(u'\xa0', u'')
    if not date: 
        pass
    elif 'Meeting' in date:
        pass
    else:
        minutes.append(date)
        if para.find('a'):
            url = para.a['href']
        else:
            url = ''
        minutes.append(url)
        minutes_list.append(minutes)
minutes_list.insert(0,['date','url'])
minutes_list

[['date', 'url'],
 ['June 4, 2019', '/files/6.4.19_minutes.pdf'],
 ['May 28, 2019', '/files/5.28.19_minutes.pdf'],
 ['May 21, 2019 CANCELLED', ''],
 ['May 7, 2019', '/files/5.7.19_minutes.pdf'],
 ['April 23, 2019', '/files/4.23.19_minutes.pdf'],
 ['April 8, 2019', '/files/4.8.19_minutes.pdf'],
 ['March 19, 2019', '/files/3.5.19_minutes.pdf'],
 ['March 5, 2019', '/files/3.5.19.pdf'],
 ['February 26, 2019', '/files/2.26.19_minutes.pdf'],
 ['February 5, 2019', '/files/2.5.19_minutes.pdf'],
 ['January 22, 2019', '/files/January_22_minutes.pdf'],
 ['January 8, 2019', '/files/January_8_minutes.pdf'],
 ['December 20, 2018', '/files/12.20.18_minutes.pdf'],
 ['December 4, 2018', '/files/12.4.18_minutes.pdf'],
 ['November 20, 2018', '/files/11.20.18.pdf'],
 ['November 7, 2018', ''],
 ['October 16, 2018', ''],
 ['September 25, 2018', '/files/9.25.18_minutes.pdf'],
 ['September 13, 2018', '/files/9.13.18_minutes.pdf'],
 ['September 4, 2018', '/files/9.4.18.pdf'],
 ['August 21, 2018', '/files/8.21.

In [10]:
df = pd.DataFrame(minutes_list)
df.columns = df.iloc[0]
df = df[1:]
df.head(15)

Unnamed: 0,date,url
1,"June 4, 2019",/files/6.4.19_minutes.pdf
2,"May 28, 2019",/files/5.28.19_minutes.pdf
3,"May 21, 2019 CANCELLED",
4,"May 7, 2019",/files/5.7.19_minutes.pdf
5,"April 23, 2019",/files/4.23.19_minutes.pdf
6,"April 8, 2019",/files/4.8.19_minutes.pdf
7,"March 19, 2019",/files/3.5.19_minutes.pdf
8,"March 5, 2019",/files/3.5.19.pdf
9,"February 26, 2019",/files/2.26.19_minutes.pdf
10,"February 5, 2019",/files/2.5.19_minutes.pdf


In [11]:
df['comment'] = df.date.str.extract(r'\d{4}(.*)', expand=False).str.strip().str.lower()
df['date'] = df.date.str.extract(r'(.*\d{4})', expand=False)
df['date'] = df['date'].replace('\s', '', regex=True)
df['date'] = pd.to_datetime(df['date'],format='%B%d,%Y') 
df

Unnamed: 0,date,url,comment
1,2019-06-04,/files/6.4.19_minutes.pdf,
2,2019-05-28,/files/5.28.19_minutes.pdf,
3,2019-05-21,,cancelled
4,2019-05-07,/files/5.7.19_minutes.pdf,
5,2019-04-23,/files/4.23.19_minutes.pdf,
6,2019-04-08,/files/4.8.19_minutes.pdf,
7,2019-03-19,/files/3.5.19_minutes.pdf,
8,2019-03-05,/files/3.5.19.pdf,
9,2019-02-26,/files/2.26.19_minutes.pdf,
10,2019-02-05,/files/2.5.19_minutes.pdf,
