# Web application for a travel agency - Artificial data creation

#### Ewa Dobrowolska, January 2023

The aim of this project is to create a **web application for a travel agency** -- a website where the customers can view the available trips, buy them, contact the owner etc. The owner can add new offers, remove offers, alter existing trips, view the list of participants of each trip etc. The website will be provided with the artficially created data regarding the customers, tours and registered users. The proccess of data creation is based on web scrapping and `pandas` methods. 

## 1. Customers data

Firstly, the data for customers was created. We need:
 * Name
 * Surname
 * Date of birth
 * Passport number
 * Passport expiration date
 * Phone number
 * Personal identification number (SSN) 
 * user_id

The data were generated randomly. The names and surnames were drawn from the webscrapped list of the most popular names and the most popular surnames in the USA.

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
import lxml.html
import pandas as pd
import random
import time
import datetime
from string import ascii_uppercase
import numpy as np

number_of_customers = 5000
Customers_df = pd.DataFrame()

The names were webscrapped from these websites: [1000 most popular boy names](https://findthelostword.com/1000-most-popular-boy-names), [1000 most popular girl names](https://findthelostword.com/1000-most-popular-girl-names)

In [2]:
# Extract boy names
r = requests.get('https://findthelostword.com/1000-most-popular-boy-names')
bs = BeautifulSoup(r.text)
li_html = bs.findAll(lambda tag: tag.name=='li')

boys = []
for element in li_html:
    t = lxml.html.fromstring(str(element))
    boys.append(t.text_content())

# Ectraxt girl names
r = requests.get('https://findthelostword.com/1000-most-popular-girl-names')
bs = BeautifulSoup(r.text)
li_html = bs.findAll(lambda tag: tag.name=='li')

girls = []
for element in li_html:
    t = lxml.html.fromstring(str(element))
    girls.append(t.text_content())

# Check if everything works as it should
names = boys+girls
print(f"Boys names is the {type(boys)} of length {len(boys)}. First five elements: {boys[0:5]}")
print(f"Girls names is the {type(girls)} of length {len(girls)}. First five elements: {girls[0:5]}")
print(f"The final list of names is of length {len(names)}.")
# It seems OK.

# First column in the data frame - names of customers
Customers_df["Name"] = [random.choice(names) for x in range(number_of_customers)]

Boys names is the <class 'list'> of length 1000. First five elements: ['Michael', 'Christopher', 'Matthew', 'David', 'James']
Girls names is the <class 'list'> of length 1000. First five elements: ['Jennifer', 'Jessica', 'Ashley', 'Sarah', 'Amanda']
The final list of names is of length 2000.


The surnames were webscrapped from the website: [5000 most common surnames in the USA](https://americansurnames.us/top-surnames/1)

In [3]:
# Extract surnames in a loop

surnames = []

for page_number in range(1, 21):
    r = requests.get('https://americansurnames.us/top-surnames/' + str(page_number))
    bs = BeautifulSoup(r.text)
    li_html = bs.findAll(lambda tag: tag.has_attr("href") and tag["href"].split("/")[1]=="surname")
    
    for element in li_html:
        t = lxml.html.fromstring(str(element))
        surnames.append(t.text_content())
    sleep(1)
    

# Check if everything works as it should
print(f"Surnames is the {type(surnames)} of length {len(surnames)}. First five elements: {surnames[0:5]}, last five elements: {surnames[-5:]}.")
# It seems OK.

Customers_df["Surname"] = [random.choice(surnames) for x in range(number_of_customers)]

Surnames is the <class 'list'> of length 5000. First five elements: ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones'], last five elements: ['Bill', 'Mulcahy', 'Dionne', 'Rathbun', 'Baeza'].


The dates of birth, Passport numbers, Passport expiration dates, phone numbers are generated randomly:

In [4]:
def create_random_date(start_date, end_date):
    delta = end_date - start_date
    random_days = random.randrange(delta.days)
    return start_date + datetime.timedelta(days=random_days)

Customers_df["Date_Of_Birth"] = [str(create_random_date(datetime.date(1935, 1, 1), datetime.date(2004, 12, 31))) for x in range(number_of_customers)]
Customers_df["Passport_Number"] = [random.choice(ascii_uppercase) + str(random.choice(range(99999999))).zfill(8) for x in range(number_of_customers)]
Customers_df["Passport_Expiration_Date"] = [str(create_random_date(datetime.date(2024, 1, 1), datetime.date(2032, 1, 1))) for x in range(number_of_customers)]
Customers_df["Phone_Number"] = [str(random.choice(range(9999999999))).zfill(10) for x in range(number_of_customers)]

Customers' Personal Identification Numbers (or SSN, Social Security Numbers) are generated randomly according to the official rules (not every number is a valid SSN number):

In [5]:
valid_beginnings = list(range(1,666)) + list(range(667, 900))
Customers_df["SSN"] = [str(random.choice(valid_beginnings)).zfill(3) + str(random.choice(range(1, 100))).zfill(2) + str(random.choice(range(1, 10000))).zfill(4) for x in range(number_of_customers)]

Customers are tours' participants. Each participant needs to be registered for the trip by a user. Therefore, in our data frame we create an additional column `user_id`, which will connect each Customer to the User, who registered them for a trip. (i.e. in general, you need an account to buy a trip. However, one user can be connected to multiple customers, eg. friends or family members - there is no need to create a separate account for everyone).
For simplicity, we assume that certain number of customers (set by the variable `number_of_users`, in this case 2000) have created an account, and the rest of the customers is randomly connected to one of these existing accounts. 

In [6]:
number_of_users = 2000
user_id = list(range(1, number_of_users+1))
user_id.extend([random.choice(user_id) for x in range(number_of_customers-number_of_users)])
Customers_df['user_id'] = user_id
Customers_df

Unnamed: 0,Name,Surname,Date_Of_Birth,Passport_Number,Passport_Expiration_Date,Phone_Number,SSN,user_id
0,Lizbeth,Kirsch,1950-03-28,M54529782,2027-11-18,8223733729,067504545,1
1,Vicente,Diaz,1991-01-23,G82760672,2024-02-06,0549554272,736993859,2
2,Adam,Simpson,1995-03-21,W98804498,2026-06-23,4640729361,559088743,3
3,Ethan,March,1987-10-10,H29711240,2029-01-02,2241841447,586093225,4
4,Erin,Ho,1997-07-12,F61405294,2026-07-29,9360933140,188746335,5
...,...,...,...,...,...,...,...,...
4995,Victoria,Martines,1976-11-10,H19160967,2026-02-09,4412320738,521268111,1352
4996,Reese,Lerma,1981-03-15,P72550011,2026-02-09,8650405089,767672976,263
4997,Aric,Tisdale,1998-03-04,Z59495341,2029-12-01,0579977301,725087363,877
4998,Jaden,Mohamed,1993-09-12,Y15966654,2028-05-25,9799547449,319494223,801


Save the data to a `.csv` file:

In [7]:
Customers_df.to_csv("customers_df.csv", index=False)

## 2. Users data

As mentioned, we assume that the first 2000 (=`number_of_users`) customers have created their own accounts, and the rest of the customers is connected to these 2000 accounts. So our DataFrame `Users_df` will consist of 2000 (=`number_of_users`) rows, and three columns:
 * **Name** - from the DataFrame `Customers_df` 
 * **E-mail** - randomly generated from the customer's name, surname, one of the most popular e-mail domains, and a few random digits.
 * **Passord** - for simplicity, we will use customers' SSN numbers as passwords.

In [8]:
Users_df = pd.DataFrame()
Users_df["Name"] = Customers_df["Name"][0:number_of_users]

email_domains = ["@gmail.com", "@yahoo.com", "@hotmail.com", "@aol.com", "@hotmail.co.uk", "@msn.com"]
Users_df["Email"] = [Customers_df["Name"][i][0].lower() + '.' + Customers_df["Surname"][i].lower() + str(random.choice(list(range(1000)))) + random.choice(email_domains) for i in range(number_of_users)]
Users_df["Password"] = Customers_df["SSN"][0:number_of_users]
Users_df

Unnamed: 0,Name,Email,Password
0,Lizbeth,l.kirsch769@hotmail.co.uk,067504545
1,Vicente,v.diaz215@yahoo.com,736993859
2,Adam,a.simpson277@hotmail.com,559088743
3,Ethan,e.march694@msn.com,586093225
4,Erin,e.ho365@yahoo.com,188746335
...,...,...,...
1995,Valentina,v.tejeda415@gmail.com,725472067
1996,Frankie,f.suarez665@hotmail.com,294424909
1997,Tasha,t.romeo555@hotmail.co.uk,375727268
1998,Clare,c.markley355@hotmail.co.uk,172465229


Save the data to a `.csv` file:

In [9]:
Users_df.to_csv("users_df.csv", index=False)

## 3. Tours data

Thirdly, the tours data were created. Necessary features were:
 * Country of destination
 * Region of destination
 * City of departure
 * Start date
 * End date
 * Duration
 * Price
 * Children accessibility

The method of data creation was similar to the one above.

In [10]:
number_of_tours = 250
Tours_df = pd.DataFrame()

At first, the data regarding countries tourism wre webscrapped from the [ranking of countries by international arrivals](https://www.indexmundi.com/facts/indicators/ST.INT.ARVL/rankings)

In [11]:
r = requests.get('https://www.indexmundi.com/facts/indicators/ST.INT.ARVL/rankings')
bs = BeautifulSoup(r.text)
table = bs.findAll(lambda tag: tag.name=='table')

In [12]:
ranking = pd.read_html(str(table[0]), header=0)[0]
ranking[(ranking["Country"] == 'Ukraine') | (ranking["Country"] == 'Russia')]

Unnamed: 0,Rank,Country,Value,Year
29,30,Russia,6359000.0,2020
49,50,Ukraine,3382000.0,2020


As we can see, in the ranking Ukraine and Russia are listed. Due to the war, it is not possible to travel there, so they are to be removed from the data frame.

In [13]:
ranking.drop(ranking[(ranking["Country"] == 'Ukraine') | (ranking["Country"] == 'Russia')].index, inplace=True)
ranking[(ranking["Country"] == 'Ukraine') | (ranking["Country"] == 'Russia')] # it worked

Unnamed: 0,Rank,Country,Value,Year


In [14]:
Tours_df["Destination"] = random.choices(list(ranking["Country"]), weights = list(ranking["Value"]), k = number_of_tours)

Tours_df.Destination.value_counts() # It looks OK

France        31
Poland        19
Thailand      16
Mexico        13
Spain         13
              ..
Kenya          1
Tunisia        1
Azerbaijan     1
Montenegro     1
Belarus        1
Name: Destination, Length: 66, dtype: int64

Now it's time to assign each country its region.

In [15]:
ctry_name = []
region = []

for region_name in ['africa', 'asia', 'central_america_and_the_caribbean', 'europe', 'middle_east', 'north_america',
                   'oceania', 'south_america', 'southeast_asia', 'arctic_region', 'antarctic_region']:
    nicer_name = str()
    for i in range(len(region_name.split("_"))):
        nicer_name = nicer_name + region_name.split("_")[i].capitalize()+' '
    nicer_name = nicer_name[:-1]
    
    r = requests.get('https://www.indexmundi.com/' + region_name + '.html')
    bs = BeautifulSoup(r.text)
    li_html = bs.findAll(lambda tag: tag.name=='li')

    for element in li_html:
        t = lxml.html.fromstring(str(element))
        ctry_name.append(t.text_content())
        region.append(nicer_name)
    sleep(1)

In [16]:
ctry_reg = dict(zip(ctry_name, region))

ranking["Reg"] = ranking["Country"].map(ctry_reg)

ranking[ranking["Reg"].isna()] # we need to fill missing values

Unnamed: 0,Rank,Country,Value,Year,Reg
16,17,Slovak Republic,15299000.0,2018,
23,24,Kyrgyz Republic,8508000.0,2019,
31,32,"Macao SAR, China",5897000.0,2020,
47,48,"Hong Kong SAR, China",3569000.0,2020,
56,57,Korea,2519000.0,2020,
57,58,Syrian Arab Republic,2424000.0,2019,
71,72,The Bahamas,1794500.0,2020,
97,98,Myanmar,903000.0,2020,
101,102,Lao PDR,886400.0,2020,
110,111,Côte d'Ivoire,668000.0,2020,


In [17]:
Reg_fill = ['Europe', 'Asia', 'Southeast Asia', 'Southeast Asia', 'Asia',
            'Middle East', 'Central America And The Caribbean', 'Southeast Asia', 'Southeast Asia', 'Africa',
            'Central America And The Caribbean', 'Central America And The Caribbean', 'Africa', 'Africa', 'Central America And The Caribbean',
            'Africa', 'Africa', 'Africa', 'Europe', 'Africa']

fill = pd.DataFrame(index = ranking.index[ranking.isnull().any(axis=1)], data = Reg_fill,columns=["Reg"])
ranking = ranking.fillna(fill)

ranking[ranking["Reg"].isna()] # Success!

ctry_reg2 = dict(zip(ranking["Country"], ranking["Reg"])) # Better version of a dictionary

Tours_df["Region"] = Tours_df["Destination"].map(ctry_reg2)
Tours_df['Region'] = Tours_df["Region"].replace('Central America And The Caribbean', 'Central America') # Shorter name looks better 
Tours_df # Looks OK

Unnamed: 0,Destination,Region
0,Poland,Europe
1,Italy,Europe
2,Singapore,Southeast Asia
3,Canada,North America
4,Ireland,Europe
...,...,...
245,France,Europe
246,Czech Republic,Europe
247,France,Europe
248,France,Europe


We pick the start and end dates randomly, such that each tour lasts between 7 and 30 days.

In [18]:
possible_durations = [7, 10, 12, 14, 16, 17, 18, 21, 24, 25, 30]
frequencies = [20, 10, 10, 20, 7, 5, 5, 7, 7, 5, 3]
Tours_df["Start_Date"] = [create_random_date(datetime.date(2023, 1, 1), datetime.date(2024, 6, 30)) for x in range(number_of_tours)]

def random_date_shift(start_date, possible_durations, frequencies):
    random_days = random.choices(possible_durations, weights = frequencies, k=1)
    return start_date + datetime.timedelta(days=random_days[0])

Tours_df["End_Date"] = Tours_df["Start_Date"].map(lambda x: random_date_shift(x, possible_durations, frequencies))
Tours_df["Duration"] = (Tours_df["End_Date"]-Tours_df["Start_Date"])
Tours_df["Duration"] = [Tours_df.Duration[i].days for i in range(number_of_tours)]

Next feature is the city of departure:

In [19]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States')
bs = BeautifulSoup(r.text)
table = bs.findAll(lambda tag: tag.name=='table')

airports = pd.read_html(str(table[0]), header=0)[0]
airports["Major cities served"]
airports["Major cities served"].replace(to_replace = ["Dallas/Fort Worth", "Baltimore & Washington, D.C."], value = ["Dallas", "Washington, D.C."], inplace=True)
possible_airports = list(set(airports["Major cities served"]))

Tours_df["City_Of_Departure"] = [random.choice(possible_airports) for x in range(number_of_tours)]

Prices created randomly and adjusted to the type of trip:

In [20]:
Tours_df["Price"] = [random.choice(range(1000, 3000)) for x in range(number_of_tours)]

bonus_Europe = Tours_df["Region"].map(lambda x: 2000 if x =="Europe" else 0)
bonus_Asia = Tours_df["Region"].map(lambda x: 1000 if x =="Asia" else 0)

tripduration_costs = dict(zip(possible_durations, [0, 200, 400, 500, 600, 700, 800, 1000, 1300, 1500, 2000]))

bonus_duration = Tours_df["Duration"].map(tripduration_costs)
total_bonus = bonus_Europe + bonus_Asia + bonus_duration

Tours_df["Price"] = Tours_df["Price"] + total_bonus

Finally, possibility to travel with children:

In [21]:
opts = ["Yes", "No"]
Tours_df["Children"] = random.choices(opts, weights = [0.3, 0.7], k=number_of_tours)

Tours_df

Unnamed: 0,Destination,Region,Start_Date,End_Date,Duration,City_Of_Departure,Price,Children
0,Poland,Europe,2023-10-29,2023-11-19,21,"Washington, D.C.",4698,Yes
1,Italy,Europe,2023-12-18,2024-01-17,30,Atlanta,5094,No
2,Singapore,Southeast Asia,2023-09-23,2023-10-07,14,Miami,2083,No
3,Canada,North America,2024-02-25,2024-03-03,7,Denver,1758,Yes
4,Ireland,Europe,2023-06-23,2023-06-30,7,Tampa,3011,No
...,...,...,...,...,...,...,...,...
245,France,Europe,2023-09-27,2023-10-11,14,Charlotte,4771,No
246,Czech Republic,Europe,2023-11-28,2023-12-12,14,Tampa,4381,Yes
247,France,Europe,2023-09-14,2023-09-26,12,Atlanta,3457,No
248,France,Europe,2023-08-15,2023-08-29,14,Newark,4043,Yes


Save the data to a `.csv` file:

In [22]:
Tours_df.to_csv("tours_df.csv", index=False)

## 4. Bookings data

Finally, the last DataFrame: `Bookings_df` informs, which customer chose which trip. We assume that maximum number of participants in each trip is 25.

In [23]:
Bookings_df = pd.DataFrame(columns=["tour_id", "customer_id"])

for i in range(1, number_of_tours+1):
    number_of_participants = random.choice(list(range(26)))
    tour_participants = [random.choice(list(range(1, number_of_customers+1))) for x in range(number_of_participants)]
    for participant in tour_participants:
        Bookings_df.loc[len(Bookings_df)] = [i, participant]

Bookings_df

Unnamed: 0,tour_id,customer_id
0,1,1126
1,1,266
2,1,4844
3,1,2182
4,1,2120
...,...,...
3118,249,4179
3119,249,3484
3120,250,890
3121,250,4350


Save the results to a `.csv` file:

In [24]:
Bookings_df.to_csv("bookings_df.csv", index=False)

Now we can proceed with the web application - all the data is ready.