# Converting CSV to TSV
1. To create Python Script Run in Terminal: touch csv2tab && chmod u+x csv2tab
2. Paste 3 lines: 
    - #!/usr/bin/env python
    - import csv, sys
    - csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))
2. Move csv2tab to script folder "./dsci_550_a1
3. Activate Conda Environment and Run:
    - ./dsci_550_a1/csv2tab < ./data/raw/haunted_places.csv > ./data/raw/haunted_places.tab


## Preliminary Data Exploration

In [2]:
import pandas as pd

df_raw = pd.read_csv("../data/raw/haunted_places.tab", sep = "\t") 
df_raw.head()

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.49548,42.960727
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.75303,42.243097


In [72]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   city            10989 non-null  object 
 1   country         10992 non-null  object 
 2   description     10992 non-null  object 
 3   location        10989 non-null  object 
 4   state           10992 non-null  object 
 5   state_abbrev    10992 non-null  object 
 6   longitude       9731 non-null   float64
 7   latitude        9731 non-null   float64
 8   city_longitude  10963 non-null  float64
 9   city_latitude   10963 non-null  float64
dtypes: float64(4), object(6)
memory usage: 858.9+ KB


### Rows with Missing Values
**We have 1272 rows with missing values**. 
- Values are in longitude and latitude columns
- Local descriptors used "Hikyes Tomb, Hell's Bridge", "Where the old Train Station used to be"



In [73]:
# Rows with NAN Values
df_raw[df_raw.isna().any(axis = 1)].head(20)


Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude
6,Algoma Township,United States,On a winding dirt road next to the Rogue River...,Hell's Bridge,Michigan,MI,,,-85.62293,43.149293
15,Assininns,United States,Before the building was turned into a Tribal C...,the old tribal center/orphanage,Michigan,MI,,,-88.477352,46.81021
17,Augusta,United States,Hotel now owned by Michigan State University. ...,Brook Lodge Hotel,Michigan,MI,,,-85.352222,42.336429
49,Burton,United States,ticking clocks that are not there. Things will...,Naturally You,Michigan,MI,,,-83.616342,42.999472
50,Byron,United States,There was a man named Nick Bradon that lived a...,McGuire's House,Michigan,MI,,,-83.944403,42.822809
61,Cheboygan,United States,An old farmer went insane and killed his famil...,Hikyes Tomb,Michigan,MI,,,-84.47448,45.646956
63,Cheboygan,United States,Years ago there was a terrible train accident ...,Where the old Train Station used to be,Michigan,MI,,,-84.47448,45.646956
66,Chelsea,United States,The foundation of the house is still there. Th...,Pink Palace,Michigan,MI,,,-84.020503,42.318092
67,Chesaning,United States,Store merchandise has been found moved when st...,Chesaning Market Street Square (a mini mall wi...,Michigan,MI,,,-84.114975,43.184748
78,Cockeysville,United States,Built over a cemetery where odd things always ...,Padonia Park Club,Michigan,MI,-76.671167,39.453642,,


### **Top Cities**:
- Top cities have high population
- Notable cities with lower population:
    - Honolulu 
    - Salem (Witches)
    - El Paso 
    - Laredo 
    - San Antonio (Alamo?)


In [74]:
Top_Cities = df_raw.groupby("city").count().sort_values("country", ascending = False).index.tolist()

print("Top 20 Cities:" , Top_Cities[:20])
df_raw.groupby("city").count().sort_values("country", ascending = False).head(20)

Top 20 Cities: ['Los Angeles', 'San Antonio', 'Honolulu', 'Pittsburgh', 'Columbus', 'Springfield', 'Salem', 'El Paso', 'Houston', 'Laredo', 'Orlando', 'Tucson', 'Riverside', 'Chicago', 'Portland', 'Seattle', 'San Francisco', 'San Diego', 'Louisville', 'Lexington']


Unnamed: 0_level_0,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Los Angeles,61,61,61,61,61,60,60,61,61
San Antonio,55,55,55,55,55,55,55,55,55
Honolulu,43,43,43,43,43,43,43,43,43
Pittsburgh,42,42,42,42,42,39,39,42,42
Columbus,41,41,41,41,41,33,33,41,41
Springfield,40,40,40,40,40,36,36,40,40
Salem,40,40,40,40,40,38,38,40,40
El Paso,38,38,38,38,38,38,38,38,38
Houston,34,34,34,34,34,34,34,34,34
Laredo,32,32,32,32,32,31,31,32,32


#### **Common Locations: Schools**
    Looking into Honolulu and San Antonio, the majority of these stories come from schools. 
    - Honolulu : {"Chaminade University", "Sacred Hearts Academy", "Kamehameha Schools"}
    - San Antonio : {"Taft High School", "University of the Incarnate Word"}
    - Other Notable Locations: {Motels, Hotels, Religious Buildings, The Alamo}

In [75]:
df_raw.groupby("city").get_group("San Antonio")
# df_raw.groupby("city").get_group("Honolulu")

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude
8571,San Antonio,United States,Cold spots and a feeling of melancholy can be ...,The Alamo,Texas,TX,-98.486142,29.425967,-98.493628,29.424122
8572,San Antonio,United States,well the theater was built over an old cement ...,Alamo quarry theaters,Texas,TX,-98.480473,29.495291,-98.493628,29.424122
8573,San Antonio,United States,There is a ghost by the name Margarite that ha...,Alamo Street Theater,Texas,TX,-98.487801,29.419948,-98.493628,29.424122
8574,San Antonio,United States,"Before the Alamodome was built, existed a bad ...",The Alamodome,Texas,TX,-98.478814,29.416983,-98.493628,29.424122
8575,San Antonio,United States,Caribbean Apartments - Now known as the Willow...,Bexar,Texas,TX,-98.493628,29.424122,-98.493628,29.424122
8576,San Antonio,United States,"Building two, built next to one of the origina...",Bexar County Juvenile Detention Center,Texas,TX,-98.490844,29.394101,-98.493628,29.424122
8577,San Antonio,United States,Kindred elementary - it is said that a plane c...,Bexar,Texas,TX,-98.493628,29.424122,-98.493628,29.424122
8578,San Antonio,United States,Spanish Main Apartments - Rittiman Rd. - If yo...,Bexar,Texas,TX,-98.493628,29.424122,-98.493628,29.424122
8579,San Antonio,United States,The train tracks - In the 20's a bus full of k...,Bexar,Texas,TX,-98.493628,29.424122,-98.493628,29.424122
8580,San Antonio,United States,Tro Bridge - you'll see a long bridge that is ...,Bexar,Texas,TX,-98.493628,29.424122,-98.493628,29.424122


### **Top States**:
    California has most entries by a long-shot


In [76]:
Top_States = df_raw.groupby("state").count().sort_values("country", ascending = False).index.tolist()
print("Top 20 States:" , Top_States[:20])
df_raw.groupby("state").count().sort_values("country", ascending = False).head(20)

Top 20 States: ['California', 'Texas', 'Pennsylvania', 'Michigan', 'Ohio', 'New York', 'Illinois', 'Kentucky', 'Indiana', 'Massachusetts', 'Florida', 'Missouri', 'Georgia', 'Wisconsin', 'Alabama', 'Tennessee', 'Washington', 'Oklahoma', 'North Carolina', 'New Jersey']


Unnamed: 0_level_0,city,country,description,location,state_abbrev,longitude,latitude,city_longitude,city_latitude
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
California,1070,1070,1070,1068,1070,1003,1003,1069,1069
Texas,696,696,696,696,696,637,637,696,696
Pennsylvania,649,649,649,649,649,576,576,648,648
Michigan,528,529,529,529,529,460,460,526,526
Ohio,476,477,477,477,477,422,422,475,475
New York,459,459,459,459,459,422,422,459,459
Illinois,395,395,395,395,395,344,344,394,394
Kentucky,370,370,370,370,370,310,310,370,370
Indiana,351,351,351,351,351,256,256,348,348
Massachusetts,342,342,342,342,342,311,311,342,342
