# Text Mining the Web
- @author Danny Lam
- created on May 28, 2020

#### For this project, I will use BeautifulSoup to web scrape data from Craigslist.org.
- I will be search for BMWs for sale in the Los Angeles area. 
- I narrowed my search to a minimum of 1 dollar because lots of people create listings started with 0 dollar for some reason.
- After exporting my data to an excel file, I will analyze, visualize, and draw conclusions on Tableau.

- Some issues I ran into while working on this project was learning how to use python to properly retreive the desired data from Craigslist. 
- Overall, I had a lot of fun working on this project!! 

In [1]:
# import libraries 
from bs4 import BeautifulSoup #fetch the html content of the webpage
import requests #makes request to website to web scrape data

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
# create empty dictionary to hold my craiglist search entries
bmw_dict = {}

#set counter to 0 for dictionary to later correspond to the index
bmw_count = 0

#specify url
url = 'https://losangeles.craigslist.org/search/cto?sort=date&min_price=3000&auto_make_model=bmw&min_auto_miles=1&condition=10&condition=20&condition=30&condition=40&condition=50&condition=60&auto_paint=1&auto_paint=2&auto_paint=20&auto_paint=3&auto_paint=4&auto_paint=5&auto_paint=6&auto_paint=7&auto_paint=8&auto_paint=9&auto_paint=10'


In [3]:
while True:
    #set user-agent header
    headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
    
    # connect with webpage, taking the data and using BeautifulSoup to parse the html
    response = requests.get(url, headers=headers, timeout=5)
    data = response.text
    # parse the html of url using beautiful soup and store in variable soup
    soup = BeautifulSoup(data,'html.parser')
    
    #include all subclasses of this parent class and store in variable jobs
    bmws = soup.find_all('p',{'class':'result-info'})
    
    #for loop to retreive all data from website using BeautifulSoup
    for bmw in bmws:
        #set the attributes to null because there will be missing data on some listings
        condition=np.nan
        mileage=np.nan
        color=np.nan
        status=np.nan
        transmission=np.nan
        type=np.nan
        
        #get location of car listing
        location_tag = bmw.find('span',{'class':'result-hood'})
        #get location without the symbols on craigslist
        location = location_tag.text[2:-1] if location_tag else "N/A"
        
        #get the links of each car listing 
        link = bmw.find('a', {'class': 'result-title'}).get('href')
        
        #once I get the link I will make a new connection and parse with beautiful soup
        bmw_response = requests.get(link)
        bmw_data = bmw_response.text
        bmw_soup = BeautifulSoup(bmw_data, 'html.parser')
        
        #add title and price of each car listing
        title = bmw_soup.find('span', {'id':'titletextonly'}).text
        price_tag = bmw_soup.find('span', {'class':'price'})
        #get the price without the '$' symbol
        price = price_tag.text[1:]
        
        # collect the atributes of the cars
        att = bmw_soup.find_all('p',{'class':'attrgroup'})
        
        # run all the attributes in text form and loop them 
        for p in att:
            for s in p.find_all('span'):
                #split the attributes and when key matches string, then assign the value to key
                try:
                    key = s.text.split(sep=':')[0] #key is assigned to string before split
                    value = s.text.split(sep=':')[1] #value is assigned to string after split
                    if key=='condition':
                        condition=value
                    elif key=='odometer':
                        mileage=value
                    elif key=='paint color':
                        color=value
                    elif key=='title status':
                        status=value
                    elif key=='transmission':
                        transmission=value
                    elif key =='type':
                        type = value
                except:
                    pass
    
        # load up the dictionary with the columns and add 1 to counter
        bmw_dict[bmw_count] = [title, price, location, condition, mileage, color, status, transmission, type]
        bmw_count+=1

        
    # add the next page of craigslist to get more data    
    url_tag = soup.find('a',{'title':'next page'})
    if url_tag.get('href'):
        url= 'https://losangeles.craigslist.org' + url_tag.get('href')
        print(url)
    else:
        break

https://losangeles.craigslist.org/search/cto?s=120&auto_make_model=bmw&auto_paint=1&auto_paint=10&auto_paint=2&auto_paint=20&auto_paint=3&auto_paint=4&auto_paint=5&auto_paint=6&auto_paint=7&auto_paint=8&auto_paint=9&condition=10&condition=20&condition=30&condition=40&condition=50&condition=60&min_auto_miles=1&min_price=3000&sort=date
https://losangeles.craigslist.org/search/cto?s=240&auto_make_model=bmw&auto_paint=1&auto_paint=10&auto_paint=2&auto_paint=20&auto_paint=3&auto_paint=4&auto_paint=5&auto_paint=6&auto_paint=7&auto_paint=8&auto_paint=9&condition=10&condition=20&condition=30&condition=40&condition=50&condition=60&min_auto_miles=1&min_price=3000&sort=date


In [4]:
#print total BMWs listed
print("Total BMWs:", bmw_count)

Total BMWs: 274


In [5]:
# put all data into a dataframe using our dictionary
bmw_df = pd.DataFrame.from_dict(bmw_dict, orient = 'index', columns = ['Title',
                                                                        'Price',
                                                                        'Location', 
                                                                        'Condition', 
                                                                        'Mileage',
                                                                        'Color',
                                                                        'Car Status',
                                                                        'Transmission',
                                                                        'Car Type'])


bmw_df

Unnamed: 0,Title,Price,Location,Condition,Mileage,Color,Car Status,Transmission,Car Type
0,2013 BMW 335i,9999,Long Beach,good,137000,black,clean,automatic,sedan
1,2009 BMW 328i Sedan 4D,6500,ARTS DISTRICT,excellent,238000,black,clean,automatic,sedan
2,1995 BMW M3,12500,Chatsworth,good,10100,black,clean,manual,coupe
3,bmw 535i GT 2011 white clean title engine check,8000,sun valley,excellent,124000,white,clean,automatic,sedan
4,2006 BMW M6(IMMACULATE AND LOW MILES),27000,Beverly Hills,like new,35000,red,clean,automatic,coupe
5,2003 BMW 530i E39 M-Sport Sedan Low Mile’s Ori...,5800,Reseda,excellent,90000,silver,clean,automatic,sedan
6,2013 BMW 328i luxury PKG,7500,Sunland,excellent,120010,white,clean,automatic,sedan
7,1987 BMW E30 325i convertible,6000,Hawthorne,good,177000,white,clean,automatic,convertible
8,2012 BMW 328 i 3 series CLEAN TITLE 87K MILES ...,9500,Costa Mesa,excellent,87000,white,clean,automatic,sedan
9,2004 BMW 530 i,5750,,excellent,84723,grey,clean,automatic,sedan


In [6]:
#drop any duplicate listing
df = bmw_df.drop_duplicates()

In [7]:
#print new distinct total of bmw listings
print("Total BMWs:", len(df))

Total BMWs: 252


In [8]:
# look at our info of our data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 252 entries, 0 to 273
Data columns (total 9 columns):
Title           252 non-null object
Price           252 non-null object
Location        252 non-null object
Condition       252 non-null object
Mileage         252 non-null object
Color           252 non-null object
Car Status      252 non-null object
Transmission    252 non-null object
Car Type        223 non-null object
dtypes: object(9)
memory usage: 19.7+ KB


In [9]:
#describe method our data
df.describe()

Unnamed: 0,Title,Price,Location,Condition,Mileage,Color,Car Status,Transmission,Car Type
count,252,252,252.0,252,252,252,252,252,223
unique,242,142,118.0,5,201,11,3,3,7
top,1995 BMW M3,6500,,excellent,26000,black,clean,automatic,sedan
freq,3,8,36.0,135,5,71,224,204,99


In [10]:
# export data into excel file
df.to_excel('WebMining_Craigslist_BMW.xls')

### Now we will use Tableau to explore, analyze, visualize, and draw conclusions...