Data Source Used: NYPD Complaint Data Current (Year To Date), Historic https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-Year-To-Date-/5uac-w243
-
Download the CSV from the website using export as CSV for years 2017 using filter
-
Setup hpc tunnel
-
Run from cmd prompt hpc gateway
Consider the downloaded filename as filename.csv
-
scp filename.csv asb862@dumbo:projectRBDA
-
Put the data in HDFS Log into dumbo cluster
cd projectRDBA
hdfs dfs -put filename.csv crimeData.csv
Data Source Used: 2018 Yellow Taxi Trip Data https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-Year-To-Date-/5uac-w243
Click on ‘View Data’ then Export CSV
-
Download the CSV from the website using export as CSV
-
Setup hpc tunnel
-
Copy file to dumbo using WinSCP client
-
Put the data in HDFS Connect through cisco VPN client
hdfs dfs -ls /user/as12578 hdfs dfs -mkdir /user/as12578/project hdfs dfs -put taxi.csv taxiData.csv
- Use the dataset stored in hadoop for cleaning and profiling code
Data Source Used: NYC Taxi Zones https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc
-
Download the CSV from the website using export as CSV
-
Setup hpc tunnel
-
Run from cmd prompt hpc gateway
Consider the downloaded filename as filename.csv
-
scp filename.csv asb862@dumbo:projectRBDA
-
Put the data in HDFS Log into dumbo cluster
cd projectRDBA
hdfs dfs -put filename.csv taxizone.csv
- Use the dataset stored in hadoop for cleaning and profiling code
- crimeData.csv must be injested in hadoop as given in ingestion code
- taxiData.csv must be injested in hadoop as given in ingestion code
- taxizone.csv must be injested in hadoop as given in ingestion code
javac -classpath yarn classpath
-d . Cleaning.java CleaningMapper.java CleaningReducer.java
jar -cvf jar.jar *.class
hadoop jar jar.jar Cleaning crimeData.csv CleaningData
hdfs dfs -cat CleaningData/part-00000
hdfs dfs -get CleaningData/part-00000 cleanCrimeData.csv hdfs dfs -put cleanCrimeData.csv
javac -classpath yarn classpath
-d . CleaningP.java CleaningPMapper.java CleaningPReducer.java
jar -cvf jar.jar *.class
hadoop jar jar.jar CleaningP taxiData.csv CleaningPData
hdfs dfs -cat CleaningPData/part-00000
hdfs dfs -get CleaningPData/part-00000 cleanTaxiData.csv hdfs dfs -put cleanTaxiData.csv
javac -classpath yarn classpath
-d . CleaningZ.java CleaningZMapper.java CleaningZReducer.java
jar -cvf jar.jar *.class
hadoop jar jar.jar CleaningZ taxizone.csv CleaningZData
hdfs dfs -cat CleaningZData/part-00000
hdfs dfs -get CleaningZData/part-00000 cleanTaxiZoneData.csv hdfs dfs -put cleanTaxiZoneData.csv
Insert the cleanCrime.csv in Hadoop to profile the required columns
javac -classpath yarn classpath
-d . Profiling.java ProfilingMapper.java ProfilingReducer.java
jar -cvf jar.jar *.class
hadoop jar jar.jar Profiling cleanCrimeData.csv profilingOutput
hdfs dfs -cat profilingOutput/part-00000
hdfs dfs -get profilingOutput/part-00000 profileOut.txt
hdfs dfs -mkdir taxiinput hdfs dfs -put cleanTaxiData.csv taxiinput/cleanTaxiData.csv
hdfs dfs -mkdir crimeinput hdfs dfs -put cleanCrimeData.csv crimeinput/cleanCrimeData.csv
hdfs dfs -mkdir taxizone hdfs dfs -put cleanTaxiZoneData.csv taxizone/cleanTaxiZoneData.csv
beeline -u jdbc:hive2://babar.es.its.nyu.edu:10000/ -n netid -p pass -f analytics.sql