# Texas Tow Trucks

We're going to scrape some [tow trucks in Texas](https://www.tdlr.texas.gov/tools_search/).

## Import your imports

In [19]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

import pandas as pd
import time
import re

In [20]:
driver = webdriver.Chrome()

In [21]:
driver.get('https://www.tdlr.texas.gov/tools_search/')

## Search for the TLDR Number `006179570C`, and scrape the information on that company

Using [license information system](https://www.tdlr.texas.gov/tools_search/), find information about the tow truck number above, displaying the

- The business name
- Owner/operator
- Phone number
- License status (Active, Expired, Etc)
- Physical address

If you can't figure a 'nice' way to locate something, your two last options might be:

- **Find a "parent" element, then dig inside**
- **Find all of a type of element** (like we did with `td` before) and get the `[0]`, `[1]`, `[2]`, etc
- **XPath** (inspect an element, Copy > Copy XPath)

These kinds of techniques tend to break when you're on other result pages, but... maybe not! You won't know until you try.

> - *TIP: When you use xpath, you CANNOT use double quotes or Python will get confused. Use single quotes.*
> - *TIP: You can clean your data up if you want to, or leave it dirty to clean later*
> - *TIP: The address part can be tough, but you have a few options. You can use a combination of `.split` and list slicing to clean it now, or clean it later in the dataframe with regular expressions. Or other options, too, probably*

In [22]:
text_input = driver.find_element_by_name('mcrdata') #to write the plate number into the field

In [23]:
driver.execute_script("arguments[0].scrollIntoView(true)", text_input) #to scroll down
text_input.send_keys("006179570C")

In [24]:
search_button = driver.find_element_by_name('proc') #find the search button and click it
search_button.click()

In [25]:
cells = driver.find_elements_by_tag_name('td')

name = cells[5].text
#name
owner_officer = cells[7].text
#owner_officer
phone_number = cells[9].text
#phone_number
license_status = cells[12].text
#license_status
address = cells[12].text
cells[14].text


'Carrier Type:  Tow Truck Company\nNumber of Active Tow Trucks:   0\n\nAddress Information\nMailing:\n13619 BRETT JACKSON RD\nFORT WORTH, TX. 76179\n\nPhysical:\n13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179'

In [26]:
whole_info = cells[14].text
whole_info

'Carrier Type:  Tow Truck Company\nNumber of Active Tow Trucks:   0\n\nAddress Information\nMailing:\n13619 BRETT JACKSON RD\nFORT WORTH, TX. 76179\n\nPhysical:\n13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179'

In [27]:
whole_info.split(':')[-1]

'\n13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179'

In [28]:
physical_adress = whole_info.split(':\n')[-1] #to split the last line after the : and \n to get rid of \n
physical_adress

'13619 BRETT JACKSON RD.\nFORT WORTH, TX. 76179'

# Adapt this to work inside of a single cell

Double-check that it works. You want it to print out all of the details.

In [80]:
#1.) Open Chrome and send the driver to that url
driver.get('https://www.tdlr.texas.gov/tools_search/')

#2.) Search for field and number
driver.find_element_by_id('mcrbutton').click() #find the button an click it
driver.find_element_by_id('mcrdata').send_keys('006179570C') #send the number to the TDLR-Field
button = driver.find_element_by_id('submit3') #identify search field
driver.execute_script("arguments[0].scrollIntoView(true)", button) #scroll to the search
button.click()

#3.) Scraping the details of the results page: I'll try another way because I think the way I did it was very messy.
time.sleep(1)
#I use the xpath of the whole info_block!
company_info = re.split(":   ",driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody').text)
#print(company_info) to doublecheck if it works. I see a list with diffrent objects in it
print(company_info[1]) #companyname
business = company_info[1]
print(re.split(" /", company_info[3])[0]) #owner
owner = re.split(" /", company_info[3])[0]
print(company_info[4]) #phone
phone = company_info[4]

#Now I use the xpath of the whole table 3
company_details = driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]').text
#print(company_details)  #Again to make sure if it works
print(re.findall("\((.*)\)", company_details)[0]) #status (Note: re.findall returns a list of strings. [0] gives me the first string of that list)

status = re.findall("\((.*)\)", company_details)[0]

address = re.split(":\n", company_details)[-1] #I split the address after \n
print(str.replace(address, "\n", " ")) # I replace \n with nothing
address = str.replace(address, "\n", " ") #Override the former variable "address" with the clean one

B.D. SMITH TOWING DBA
BRANDT SMITH
8173330706
Expired
13619 BRETT JACKSON RD. FORT WORTH, TX. 76179


# Using .apply to find data about SEVERAL tow truck companies

The file `trucks-subset.csv` has information about the trucks, we'll use it to find the pages to scrape.

### Open up `trucks-subset.csv` and save it into a dataframe

In [81]:
df = pd.read_csv('trucks-subset.csv')
df.head()

Unnamed: 0,TDLR Number
0,006507931C
1,006179570C
2,006502097C


## Go through each row of the dataset, displaying the URL you will need to scrape for the information on that row

You don't have to actually use the search form for each of these - look at the URL you're on, it has the number in it!

For example, one URL might look like `https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006495492C`.

- *TIP: Use .apply and a function*
- *TIP: Unlike the Yelp example, you'll need to build this URL from pieces*
- *TIP: You probably don't want to `print` unless you're going to fix it for the next question 
- *TIP: pandas won't showing you the entire url! Run `pd.set_option('display.max_colwidth', -1)` to display aaaalll of the text in a cell*

In [82]:
df['TDLR Number']

0    006507931C
1    006179570C
2    006502097C
Name: TDLR Number, dtype: object

In [188]:
#driver = webdriver.Chrome()
#driver.get("https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006507931C")

In [192]:
def get_towtruck(row):
    #navigate the driver to the url
    #driver.get('https://www.tdlr.texas.gov/tools_search/')
    # fill out the form and search
    #driver.find_element_by_id('mcrbutton').click() #find the button an click it
    #driver.find_element_by_id('mcrdata').send_keys(row['TDLR Number']) #send the licensplate to the TDLR-Field
    #button = driver.find_element_by_id('submit3') #identify search field
    #driver.execute_script("arguments[0].scrollIntoView(true)", button) #scroll to the search
    #button.click()
    
    url = 'https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber='+row['TDLR Number']
    return url

In [193]:
df.apply(get_towtruck, axis=1)

0    https://www.tdlr.texas.gov/tools_search/mccs_d...
1    https://www.tdlr.texas.gov/tools_search/mccs_d...
2    https://www.tdlr.texas.gov/tools_search/mccs_d...
dtype: object

### Save this URL into a new column of your dataframe, called `url`

- *TIP: Use a function and `.apply`*
- *TIP: Be sure to use `return`*

In [35]:
def get_towtruck(row):
    #send the driver to the url
    driver.get('https://www.tdlr.texas.gov/tools_search/')
    # fill out the field and search
    driver.find_element_by_id('mcrbutton').click() #find the button an click it
    driver.find_element_by_id('mcrdata').send_keys(row['TDLR Number']) #send the number to the TDLR-Field
    button = driver.find_element_by_id('submit3') #identify search field
    driver.execute_script("arguments[0].scrollIntoView(true)", button) #scroll to the search
    button.click()
    
    url = 'https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber='+row['TDLR Number']
    
    return pd.Series({ #url is gonna be saved into a column of my pandas dataframe
            'url': url
            
        })

In [38]:
# to make a new DataFrame with the old and new stuff together:
url_df = df.apply(get_towtruck, axis=1).join(df)
url_df

Unnamed: 0,url,TDLR Number
0,https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006507931C,006507931C
1,https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006179570C,006179570C
2,https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006502097C,006502097C


## Go through each row of the dataset, printing out information about each tow truck company.

Now will be **scraping** inside of your function.

- The business name
- Owner/operator
- Phone number
- License status (Active, Expired, Etc)
- Physical address

Just print it out for now.

- *TIP: use .apply*
- *TIP: You'll be using the code you wrote before, but converted into a function*
- *TIP: Remember how the TDLR Number is in the URL? You don't need to do the form submission if you don't want!*
- *TIP: Make sure you adjust any variables so you don't scrape the same page again and again*

In [103]:
def get_towtruck(row):
    ###################
    #Part 1: Submission 
    ###################
    #Navigate the driver to the url
    driver.get('https://www.tdlr.texas.gov/tools_search/')
    # fill out the form and search
    driver.find_element_by_id('mcrbutton').click() #find the button an click it
    driver.find_element_by_id('mcrdata').send_keys(row['TDLR Number']) #send the number to the TDLR-Field
    button = driver.find_element_by_id('submit3') #identify search field
    driver.execute_script("arguments[0].scrollIntoView(true)", button) #scroll to the search
    button.click()
    
    ##############################################
    #Part 2: Get the details from the results page
    ##############################################
    company_info = re.split(":   ",driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody').text)
    print(company_info[1]) #companyname
    business = company_info[1]
    print(re.split(" /", company_info[3])[0]) #owner
    owner = re.split(" /", company_info[3])[0]
    print(company_info[4]) #phone
    phone = company_info[4]
    
    company_details = driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]').text
    print(re.findall("\((.*)\)", company_details)[0]) #status
    status = re.findall("\((.*)\)", company_details)[0]
    
    address = re.split(":\n", company_details)[-1]
    print(str.replace(address, "\n", " "))
    address = str.replace(address, "\n", " ")

In [43]:
url_df.apply(get_towtruck, axis=1)

AUGUSTUS E SMITH DBA
AUGUSTUS EUGENE SMITH
9032276464
Active
103 N MAIN ST BONHAM, TX. 75418
B.D. SMITH TOWING DBA
BRANDT SMITH
8173330706
Expired
13619 BRETT JACKSON RD. FORT WORTH, TX. 76179
BARRY MICHAEL SMITH DBA
BARRY MICHAEL SMITH
8066544404
Active
4501 W CEMETERY RD CANYON, TX. 79015


0    None
1    None
2    None
dtype: object

## Scrape the following information for each row of the dataset, and save it into new columns in your dataframe.

- The business name
- Owner/operator
- Phone number
- License status (Active, Expired, Etc)
- Physical address

It's basically what we did before, but using the function a little differently.

- *TIP: Same as above, but you'll be returning a `pd.Series` and the `.apply` line is going to be a lot longer*
- *TIP: Save it to a new dataframe!*
- *TIP: Make sure you change your `df` variable names correctly if you're cutting and pasting - there are a few so it can get tricky*

In [102]:
def get_towtruck(row):
    #navigate the driver to the url
    driver.get('https://www.tdlr.texas.gov/tools_search/')
    
    # fill out the form and search
    driver.find_element_by_id('mcrbutton').click() #find the button an click it
    driver.find_element_by_id('mcrdata').send_keys(row['TDLR Number']) #send the licensplate to the TDLR-Field
    button = driver.find_element_by_id('submit3') #identify search field
    driver.execute_script("arguments[0].scrollIntoView(true)", button) #scroll to the search
    button.click()
    
    #get the details from the result page
    time.sleep(1)
    company_info = re.split(":   ",driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody').text)
    print(company_info[1]) #companyname
    business = company_info[1]
    print(re.split(" /", company_info[3])[0]) #owner
    owner = re.split(" /", company_info[3])[0]
    print(company_info[4]) #phone
    phone = company_info[4]
    
    company_details = driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]').text
    print(re.findall("\((.*)\)", company_details)[0]) #status
    status = re.findall("\((.*)\)", company_details)[0]
    
    address = re.split(":\n", company_details)[-1]
    print(str.replace(address, "\n", " "))
    address = str.replace(address, "\n", " ")
    
    return pd.Series({
        'Business Name': business,
        'Owner': owner,
        'Phone number': phone,
        'License status': status,
        'Physical address': address
        
    })

In [91]:
new_df = url_df.apply(get_towtruck, axis=1).join(url_df)
new_df

AUGUSTUS E SMITH DBA
AUGUSTUS EUGENE SMITH
9032276464
Active
103 N MAIN ST BONHAM, TX. 75418
AUGUSTUS E SMITH DBA
AUGUSTUS EUGENE SMITH
9032276464
Active
103 N MAIN ST BONHAM, TX. 75418
B.D. SMITH TOWING DBA
BRANDT SMITH
8173330706
Expired
13619 BRETT JACKSON RD. FORT WORTH, TX. 76179
BARRY MICHAEL SMITH DBA
BARRY MICHAEL SMITH
8066544404
Active
4501 W CEMETERY RD CANYON, TX. 79015


Unnamed: 0,Business Name,Owner,Phone number,License status,Physical address,url,TDLR Number
0,AUGUSTUS E SMITH DBA,AUGUSTUS EUGENE SMITH,9032276464,Active,"103 N MAIN ST BONHAM, TX. 75418",https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006507931C,006507931C
1,B.D. SMITH TOWING DBA,BRANDT SMITH,8173330706,Expired,"13619 BRETT JACKSON RD. FORT WORTH, TX. 76179",https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006179570C,006179570C
2,BARRY MICHAEL SMITH DBA,BARRY MICHAEL SMITH,8066544404,Active,"4501 W CEMETERY RD CANYON, TX. 79015",https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006502097C,006502097C


### Save your dataframe as a CSV named `tow-trucks-extended.csv`

In [92]:
new_df.to_csv('tow-trucks-extended.csv', index=False)

### Re-open your dataframe to confirm you didn't save any extra weird columns

In [93]:
pd.read_csv('tow-trucks-extended.csv')

Unnamed: 0,Business Name,Owner,Phone number,License status,Physical address,url,TDLR Number
0,AUGUSTUS E SMITH DBA,AUGUSTUS EUGENE SMITH,9032276464,Active,"103 N MAIN ST BONHAM, TX. 75418",https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006507931C,006507931C
1,B.D. SMITH TOWING DBA,BRANDT SMITH,8173330706,Expired,"13619 BRETT JACKSON RD. FORT WORTH, TX. 76179",https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006179570C,006179570C
2,BARRY MICHAEL SMITH DBA,BARRY MICHAEL SMITH,8066544404,Active,"4501 W CEMETERY RD CANYON, TX. 79015",https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006502097C,006502097C


## Process the entire `tow-trucks.csv` file

We just did it on a short subset so far. Now try it on all of the tow trucks. **Save as the same filename as before**

In [94]:
df = pd.read_csv('tow-trucks.csv')
df.head()

Unnamed: 0,TDLR Number
0,006507931C
1,006179570C
2,006502097C
3,006494912C
4,0649468VSF


In [101]:
def get_towtruck(row):
        #navigate the driver to the url
    driver.get('https://www.tdlr.texas.gov/tools_search/')
    
    # fill out the form and search
    driver.find_element_by_id('mcrbutton').click() #find the button an click it
    driver.find_element_by_id('mcrdata').send_keys(row['TDLR Number']) #send the licensplate to the TDLR-Field
    print(row['TDLR Number'])
    button = driver.find_element_by_id('submit3') #identify search field
    driver.execute_script("arguments[0].scrollIntoView(true)", button) #scroll to the search
    button.click()
    url = 'https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber='+row['TDLR Number'] #to save the url in the df

    try:
        #get the details from the result page
        time.sleep(1)
        company_info = re.split(":   ",driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[2]/tbody').text)
        print(company_info[1]) #companyname
        business = company_info[1]
        print(re.split(" /", company_info[3])[0]) #owner
        owner = re.split(" /", company_info[3])[0]
        print(company_info[4]) #phone
        phone = company_info[4]
        
        company_details = driver.find_element_by_xpath('//*[@id="t1"]/tbody/tr/td/font/table[3]').text
        print(re.findall("\((.*)\)", company_details)[0]) #status
        status = re.findall("\((.*)\)", company_details)[0]
        
        address = re.split(":\n", company_details)[-1]
        print(str.replace(address, "\n", " "))
        address = str.replace(address, "\n", " ")
        
        return pd.Series({
                'business': business,
                'owner': owner,
                'phone': phone,
                'status': status,
                'address': address,
                'url': url
                
            })
    except:
        print("no record")
        #to not loose the TDLR I write no record instead
        return pd.Series({
                'business': 'no record',
                'owner': 'no record',
                'phone': 'no record',
                'status': 'no record',
                'address': 'no record',
                'url': url
                
            })

In [100]:
all_df = df.apply(get_towtruck, axis=1).join(df)

006507931C
AUGUSTUS E SMITH DBA
AUGUSTUS EUGENE SMITH
9032276464
Active
103 N MAIN ST BONHAM, TX. 75418
006507931C
AUGUSTUS E SMITH DBA
AUGUSTUS EUGENE SMITH
9032276464
Active
103 N MAIN ST BONHAM, TX. 75418
006179570C
B.D. SMITH TOWING DBA
BRANDT SMITH
8173330706
Expired
13619 BRETT JACKSON RD. FORT WORTH, TX. 76179
006502097C
BARRY MICHAEL SMITH DBA
BARRY MICHAEL SMITH
8066544404
Active
4501 W CEMETERY RD CANYON, TX. 79015
006494912C
no record
0649468VSF
no record
006448786C
HYSMITH AUTOMOTIVE DBA
WILLIAM THOMAS HYSMITH
ASHLEY ERIN HYSMITH / TREASURER
Phone
Active
1210 US 380 BYPASS GRAHAM, TX. 76450
0648444VSF
no record
0651667VSF
HYSMITH AUTOMOTIVE & TRUCK REPAIR INC DBA
WILLIAM THOMAS HYSMITH
ASHLEY ERIN HYSMITH / TREASURER
Phone
Active
1210 380 BYPASS GRAHAM, TX. 76450
006017767C
no record
006495492C
JEFF SMITH DBA
JEFFREY JOHN SMITH
8324354670
Active
4338 HARVEY RD CROSBY, TX. 77532
006518521C
LUTHER SMITH DBA
LUTHER EUGENE SMITH
281-838-9435
Insurance and/or fees not applied
20

In [None]:
#I have a few "no records" there, but no clue how to get rid of those or get the information...