# Pull California Street Names
This notebook uses BeautifulSoup to scrape street names from [Geographic.org](https://geographic.org/streetview/usa/ca/). This data could be used for training a spaCy NLP function, but we did not get to it during our process.

In [72]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

In [78]:
url = 'https://geographic.org/streetview/usa/ca/'
tag = 'index2.html'

In [79]:
res = requests.get(url + tag)

In [None]:
soup = BeautifulSoup(res.content)

## Webscrape
Loop through each county and city to scrape the county, city, street name and zip code for each street.

In [122]:
t_0 = time.time()

counties_ul = soup.find('ul')
counties = counties_ul.find_all('li')

for i, county in enumerate(counties[30:]):
    county_anchor = county.find('a')
    county_tag = county_anchor.attrs['alt'].lower().replace(' ','_')
    county_name = county_anchor.text
    county_url = url + county_tag + '/index.html'
    county_res = requests.get(county_url)
    time.sleep(2)
    county_soup = BeautifulSoup(county_res.content)
    cities = county_soup.find('ul')
    for city in cities.find_all('li'):
        city_anchor = city.find('a')
        city_tag = city_anchor.attrs['href']
        city_name = city_anchor.attrs['alt']
        city_url = county_url.replace('index',city_name.lower())
        city_res = requests.get(city_url)
        time.sleep(2)
        city_soup = BeautifulSoup(city_res.content)
        streets = city_soup.find('ul')
        for street in streets.find_all('li'):
            road = {}
            road['county'] = county_name
            road['city'] = city_name
            road['street'] = street.find('a').attrs['alt']
            road['zip_code'] = street.text[-5:]
            roads.append(road)
    print(f"[{'#' * i}{' ' * (len(counties) - i)}] {county_name} {i/len(counties)}")
print(time.time() - t_0)
pd.DataFrame(roads)

[                                                          ] Placer 0.0
[#                                                         ] Plumas 0.017241379310344827
[##                                                        ] Riverside 0.034482758620689655
[###                                                       ] Sacramento 0.05172413793103448
[####                                                      ] San Benito 0.06896551724137931
[#####                                                     ] San Bernardino 0.08620689655172414
[######                                                    ] San Diego 0.10344827586206896
[#######                                                   ] San Francisco 0.1206896551724138
[########                                                  ] San Joaquin 0.13793103448275862
[#########                                                 ] San Luis Obispo 0.15517241379310345
[##########                                                ] San Mateo 0.1724137931034483
[#

Unnamed: 0,county,city,street,zip_code
0,Alameda,Alameda,1st Street,94501
1,Alameda,Alameda,2nd Street,94501
2,Alameda,Alameda,3rd Street,94501
3,Alameda,Alameda,4th Street,94501
4,Alameda,Alameda,5th Street,94501
...,...,...,...,...
268855,Yuba,Wheatland,Wheatland Road,95692
268856,Yuba,Wheatland,Wichita Way,95692
268857,Yuba,Wheatland,Wintun Way,95692
268858,Yuba,Wheatland,Witchita Way,95692


In [125]:
pd.DataFrame(roads).to_csv('../data/california_street_names.csv', index=False)

# Summary
We scrapped all 268,000 California street names from [Geographic.org](https://geographic.org/streetview/usa/ca/).