# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

* **Hint:** You can use `.apply` to download each pdf, or you can use one of a thousand other ways. It'd be good `.apply` practice though!

In [1]:
import requests
import pandas as pd
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.mineral.k12.nv.us/pages/School_Board_Minutes")



In [5]:
page = driver.find_element_by_id('livesite-page-content-left')

In [39]:
minute_page2 = page.find_elements_by_tag_name('p')[4:-1]


In [64]:
mins = []
for item in minute_page2:
    date = item.text.strip()
#     print(date)
    try:
        url = item.find_element_by_tag_name('a').get_attribute('href')
        mins.append({
            'date': date,
            'url': url
        })
    except:
        pass

In [66]:
df = pd.DataFrame(mins)

In [68]:
df.head()

Unnamed: 0,date,url
0,"September 1, 2020",http://www.mineral.k12.nv.us/files/9.1.20_minu...
1,"August 11, 2020",http://www.mineral.k12.nv.us/files/8.11.20_min...
2,"July 28, 2020",http://www.mineral.k12.nv.us/files/7.28.20_min...
3,"July 14, 2020",http://www.mineral.k12.nv.us/files/7.14.20_min...
4,"June 16, 2020",http://www.mineral.k12.nv.us/files/6.16.20_min...


In [75]:
df['date'] = df['date'].astype('string')

In [78]:
import datetime as dt

In [85]:
df['the_date'] = pd.to_datetime(df.date, format='%B %d, %Y').dt.strftime("%Y-%m-%d")

In [90]:
df.head()

Unnamed: 0,date,url,the_date
0,"September 1, 2020",http://www.mineral.k12.nv.us/files/9.1.20_minu...,2020-09-01
1,"August 11, 2020",http://www.mineral.k12.nv.us/files/8.11.20_min...,2020-08-11
2,"July 28, 2020",http://www.mineral.k12.nv.us/files/7.28.20_min...,2020-07-28
3,"July 14, 2020",http://www.mineral.k12.nv.us/files/7.14.20_min...,2020-07-14
4,"June 16, 2020",http://www.mineral.k12.nv.us/files/6.16.20_min...,2020-06-16


In [91]:
del df['date']

In [92]:
df.head()

Unnamed: 0,url,the_date
0,http://www.mineral.k12.nv.us/files/9.1.20_minu...,2020-09-01
1,http://www.mineral.k12.nv.us/files/8.11.20_min...,2020-08-11
2,http://www.mineral.k12.nv.us/files/7.28.20_min...,2020-07-28
3,http://www.mineral.k12.nv.us/files/7.14.20_min...,2020-07-14
4,http://www.mineral.k12.nv.us/files/6.16.20_min...,2020-06-16


In [107]:
df.to_csv('minutes.csv', index=False)

### Getting pdfs

In [99]:
# row is not defined. we are creating it now
def scrape_pdf(df):
    print(df['url'])
    print('--')

In [100]:
df.apply(scrape_pdf, axis=1)

http://www.mineral.k12.nv.us/files/9.1.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/8.11.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/7.28.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/7.14.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/6.16.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/5.20.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/4.7.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/3.12.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/3.5.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/2.21.20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/2-4-20_minutes.pdf
--
http://www.mineral.k12.nv.us/files/1.21.20.pdf
--
http://www.mineral.k12.nv.us/files/1.7.20_pdf.pdf
--
http://www.mineral.k12.nv.us/files/12.16.19_minutes.pdf
--
http://www.mineral.k12.nv.us/files/12.3.19_minutes.pdf
--
http://www.mineral.k12.nv.us/files/11.19.19_minutes.pdf
--
http://www.mineral.k12.nv.us/files/11.5.19_minutes.pdf
--
http://www.mineral.k12.nv.us/

0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
22    None
23    None
24    None
25    None
26    None
27    None
28    None
29    None
30    None
31    None
32    None
33    None
34    None
35    None
36    None
37    None
38    None
39    None
40    None
41    None
42    None
43    None
44    None
45    None
46    None
47    None
48    None
49    None
50    None
51    None
52    None
53    None
54    None
55    None
56    None
57    None
58    None
dtype: object

In [None]:
# i'm not sure how to use selenium to download a pdf to display in the dataframe