Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regarding preprocessing dataset request #1

Open
karthikeyana opened this issue Mar 4, 2015 · 5 comments
Open

regarding preprocessing dataset request #1

karthikeyana opened this issue Mar 4, 2015 · 5 comments

Comments

@karthikeyana
Copy link

can you post data preprocessing program in our blog.

@bingweiliu
Copy link
Owner

Karthikeyana, what do you mean by posting the program to your blog? Where is your blog? The pre-processing program is a simple script to processing every interview into one line and remove unneeded items.

@karthikeyana
Copy link
Author

import csv
import glob
import os

directory = raw_input("INPUT Folde:")
output = raw_input("OUTPUT Folder:")

txt_files = os.path.join(directory, '*.txt')

for txt_file in glob.glob(txt_files):
with open(txt_file, "rb") as input_file:
in_txt = csv.reader(input_file, delimiter='=')
filename = os.path.splitext(os.path.basename(txt_file))[0] + '.csv'

    with open(os.path.join(output, filename), 'wb') as output_file:
        out_csv = csv.writer(output_file)
        out_csv.writerows(in_txt)

sir i am using this code to convert all txt files to csv but i did not get this format sir plase help me

:POS: :41: i disagree with the reviewers who said the movie was predictable and
drawn out it was a movie with heart and you could feel the main characters plight
when he lost his companion being an animal lover i was pulling for the happy
ending of course i am disney s biggest fan and i love this movie right along with
the others p s i am a grandmother to eleven thank heavens for disney movies
:POS: :85: sit back and enjoy the interesting and exciting story of the count of
monte cristo great rainy day movie
:POS: :95: a very well done film and an excellent cast i d put it right up with the
three and four musketeers movies york reed chamberlain heston etc
:POS: :96: this is an excellent movie and i never read the book the acting and the
plot was very nice done it is one of my favorite movies

@karthikeyana
Copy link
Author

sir can you post the script in command box

@karthikeyana
Copy link
Author

15/03/05 02:26:39 INFO input.FileInputFormat: Total input paths to process : 2
15/03/05 02:26:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/05 02:26:39 WARN snappy.LoadSnappy: Snappy native library not loaded
15/03/05 02:26:40 INFO mapred.JobClient: Running job: job_201503042232_0030
15/03/05 02:26:41 INFO mapred.JobClient: map 0% reduce 0%
15/03/05 02:26:59 INFO mapred.JobClient: map 100% reduce 0%
15/03/05 02:27:16 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000000_0, Status : FAILED
java.lang.NumberFormatException: For input string: "1""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:16 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000001_0, Status : FAILED
java.lang.NumberFormatException: For input string: "1""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:26 INFO mapred.JobClient: map 100% reduce 6%
15/03/05 02:27:28 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000000_1, Status : FAILED
java.lang.NumberFormatException: For input string: "1""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:29 INFO mapred.JobClient: map 100% reduce 0%
15/03/05 02:27:29 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000001_1, Status : FAILED
java.lang.NumberFormatException: For input string: "1""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:37 INFO mapred.JobClient: map 100% reduce 3%
15/03/05 02:27:38 INFO mapred.JobClient: map 100% reduce 6%
15/03/05 02:27:39 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000000_2, Status : FAILED
java.lang.NumberFormatException: For input string: "1""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

15/03/05 02:27:40 INFO mapred.JobClient: map 100% reduce 3%
15/03/05 02:27:40 INFO mapred.JobClient: Task Id : attempt_201503042232_0030_r_000001_2, Status : FAILED
java.lang.NumberFormatException: For input string: "1""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.valueOf(Integer.java:582)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:26)
at com.ift.hadoop.NBTrainingReducer.reduce(NBTrainingReducer.java:8)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

@karthikeyana
Copy link
Author

this is my error message when i am running in single node hadoop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants