Warning
If anyone who want to copy this project as his or her course project, do not contact me for further help.
This is a final project for Data Mining course.
- Detect aggressive comments in public communities, like YouTube, Twitter, etc.
- Detect fake news in public media platform.
- Dart for both ML(Machine Learning) algorithm and UI(User Interface).
Datasets (from Kaggle)
In this section, we will only give you an example for Cleaned Toxic Comments
datasets since all the dataset have the same steps to be processed.
We only need comment_text
and insult
columns since the other columns(attributes) are not very important.
Below is the code:
DataFrame dataFrame = await fromCsv(
'data/normal_offensive_data.csv',
columns: [0, 3]);
where 0
is the index of comment_text
column and 3
is the index of insult
column.
We only collect 10% of the whole dataset to train our model to speed up the time of building model.
Here is the code:
List<int> sampleCommentIndexes = [];
for (var i = 0; i < dataFrame.rows.length; i++) {
if (i % sampleStep == 0) {
sampleCommentIndexes.add(i);
}
}
dataFrame = dataFrame.sampleFromRows(sampleCommentIndexes);
where sampleStep
= 10.
After reducing the dimensions of dataset, we feed machine dataset like this:
flutter: DataFrame (159571 x 2)
comment_text insult
explanation why the edits ... 0.0
... ...
d aww he matches this ba ... 0.0
where comment_text
is the comment that user posted and insult
indicated whether this content was insult-able or not.
Then the content in every comment_text
will be sent to a Python program(written by ourselves) by HTTP GET
method and the response will be like this:
[
-0.7947729229927063,
-0.3663290739059448,
-0.6826316714286804,
0.5298082828521729,
0.38561514019966125,
...
0.8614583015441895
]
This array is the vectorized result of the content in single comment_text
.
Here, when the Python server receives the GET
request, it will convert String
text to List<Double>
vector by using BERT model in huggingface with TensorFlow framework.
After vectorization, the result will be like this:
flutter: Pre-processing result:
flutter: DataFrame (12689 x 769)
insult fea_0 fea_1 fea_2 fea_3 fea_4 ... fea_767
0.0 -0.7947729229927063 -0.3663290739059448 -0.6826316714286804 0.5298082828521729 0.38561514019966125 ... 0.8614583015441895
0.0 -0.6552587151527405 -0.5079750418663025 -0.9163077473640442 0.6903209686279297 0.7387704253196716 ... 0.6643986105918884
0.0 -0.7848144173622131 -0.4483385980129242 -0.904833972454071 0.7669529914855957 0.7172573208808899 ... 0.7397924065589905
0.0 -0.8217465281486511 -0.47220537066459656 -0.8824613690376282 0.7397663593292236 0.5861581563949585 ... 0.8556132316589355
1.0 -0.5986714363098145 -0.2154504358768463 0.4265202283859253 0.27244287729263306 -0.1582547426223755 ... 0.6849899291992188
... ... ... ... ... ... ... ...
0.0 -0.8255961537361145 -0.4590681493282318 -0.5918567776679993 0.7359922528266907 0.3755975067615509 ... 0.901710569858551
0.0 -0.731025218963623 -0.43930941820144653 -0.8263741135597229 0.638615608215332 0.3906306028366089 ... 0.7427312731742859
0.0 -0.7566186189651489 -0.4142800569534302 -0.7555342316627502 0.5424911379814148 0.40552330017089844 ... 0.7499610781669617
1.0 -0.7954379916191101 -0.47804439067840576 -0.79896080493927 0.6801205277442932 0.6424688100814819 ... 0.8404900431632996
0.0 -0.7470123171806335 -0.5245763659477234 -0.8404871225357056 0.6709993481636047 0.7077993154525757 ... 0.7762796878814697
flutter: Pre-processing done in 0:43:46.870327
We use supervised learning that makes use of class labels to predict information.
We choose KnnClassifier algorithm to build classification model.
To build model, just call:
model = await core.buildModel(dataFrame);
Predict the dataset with training data removed, then calculate the correction rate.
According to the metric above, our model has a correction rate between 65% and 70%.
Because of the limitation of BERT model, we can only handle text with limited word size. We have tested that the size of 600(the length of a String
) was acceptable.
Important
We do not upload the files below:
- normal_offensive_model.json
- normal_offensive_data.csv
- real_fake_model.json
- real_fake_data.csv
Follow steps from 3 to 5 to ensure your .csv
files are ready to be processed.
The .json
files are generated by program, those files will be generated when you finish traning your model.
- Clone doc2vec_server repository and run it in the background.
- Clone this repository.
- Create
data
folder in root directory of your cloned project. - Download datasets from here and put them in
data
folder. - Rename those two downloaded files to
normal_offensive_data.csv
andreal_fake_data.csv
respectively. - Run
flutter run
in the root directory.
All datasets come from Kaggle, which are public datasets.
The contents in dataset may contain some aggressive words.
Thanks to the following libraries/websites, we could finish our project successfully.
- sentiment_dart, it gives us an inspiration on how to achieve our goal at the very beginning.
- ml_dataframe, it provides a way to store and manipulate data.
- ml_algo, it provides lots of algorithms for ML.
- stopwordies, it provides English stop words, which can be used to identify relatively meaningless words in a sentence.
- document_analysis, it provides text vectorization method and gives us some ideas on how to vectorize text at the first.
- tutorial, it provides guideline on how to apply ML in Python and gives us a structure to follow.
It is not very efficient to send every single comment_text
to Python server and wait for the response.
Although there is no doc2vec libraries in Dart, we find there is an official document provides a guide on how to convert TensorFlow model to TensorFlow Lite model, and luckily there is a library called tflite_flutter_plus providing a interpreter between TensorFlow Lite and Dart.
So maybe we can convert the doc2vec model to TensorFlow Lite model and use it in Dart directly.
We presented it with PowerPoint, but we won't upload this file.