# Project Requirements
I am looking for several elements to be present in any good project. These are:

1. Explore the various aspects of the data, visualizing and analyzing it in different ways. It is really important that you are familiar with it. You should describe how you made various design choices, based on the dataset exploration.
2. Exploration of at least one or two techniques on which we did not spend significant time in class. For example, using neural networks, support vector machines, or random forests are great ideas; but if you do this, you should explore in some depth the various options available to you for parameterize the model, controlling complexity, etc. (This should involve more than simply varying a parameter and showing a plot of results.)
3. Other options might include feature design, or optimizing your models to deal with special aspects of the data (missing features, too many features, large numbers of zeros in the data; possible outlier data; etc.). Your report should describe what aspects you chose to focus on.
4. Performance validation. You should practice good form and use validation or cross-validation to assess your models’ performance, do model selection, combine models, etc. You should not simply try a few variations and assume you are done.
5. Adaptation to under- and over-fitting. Machine learning is not very “one size fits all”; it is impossible to know for sure what model to choose, what features to give it, or how to set the parameters until you see how it does on the data. Therefore, much of machine learning revolves around assessing performance (e.g., is my poor performance due to underfitting, or overfitting?) and deciding how to modify your techniques in response. Your report should describe how, during your process, you decided how to adapt your models and why.

Your team will produce a single write-up document, approximately 6 pages long, describing the problem you chose (with dataset analysis) and the methods you used to address it, including which model(s) you tried, how you trained them, how you selected any parameters they might require, and how they performed in on the test data. Consider including tables of performance of different approaches, or plots of performance used to perform model selection (i.e., parameters that control complexity). Within your document, please try to describe to the best of your ability who was responsible for which aspects (which learners, etc.), and how the team as a whole put the ideas together.

You are free to collaborate with other teams, including sharing ideas and even code, but please document where your predictions came from. (This also relaxes the proscription from posting code or asking for code help on Ed Discussion, at least for project purposes.) For example, for any code you use, please say in your report who wrote the code and how it was applied (who determined the parameter settings and how, etc.) Collaboration is particularly true for learning ensembles of predictors: your teams may each supply a set of predictors, and then collaborate to learn an ensemble from the set.

Some possible components of a successful project include:

- Semi-supervised methods: investigate how your knowledge of the test features can be used to improve prediction. As examples, see e.g., label propagation (http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdfLinks to an external site.), or using EM (within e.g. naive Bayes or a Gaussian mixture model, e.g., http://www.kamalnigam.com/papers/emcat-mlj99.pdfLinks to an external site.).
- Kernel learning, or similarity/metric learning of the measure of dissimilarity used in, for example, nearest neighbors or SVMs, to improve their performance. See for example Weinberger and Saul 2008, http://www.cse.wustl.edu/~kilian/papers/jmlr08_lmnn.pdfLinks to an external site..
- Neural networks and deep learning; using existing packages like PyTorch, Keras, MxNet, and PyLearn2.
- Support vector machines. For example, you could investigate the effect of different kernel choices, regularization, etc.). The implementation libsvm is pretty good.
- Go in-depth with ensembles. Use lots of learners, stacking, and information from your leaderboard performance to try to improve your prediction quality.
- Feature selection methods, such as stepwise regression (or in this case, classification); e.g. http://en.wikipedia.org/wiki/Stepwise_regressionLinks to an external site.. (Note: if you use feature selection, you should use a predictor that is sufficiently complex to need feature selection!)
- New Features Techniques for creating new features, including “kitchen sink” features (http://books.nips.cc/papers/files/nips21/NIPS2008_0885.pdfLinks to an external site.), clustering-based features, etc. Once you have many features, of course, you may also have to explore feature selection (see above) or regularization to control complexity.
- Sophisticated decision tree structures, e.g., http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.1587Links to an external site..
- etc.

Ideally, you should explore how your techniques work, and how to make them work better.  This should go beyond simply tuning hyperparameters, and involve thinking about how your methods interact with the properties of the data and task.  Beware of simply throwing a lot of computation at the problem, without considering its effectiveness.  Before running computationally intensive techniques, it is a good idea to create simpler baselines that will inform you of how significantly your performance is improving.

In [None]:
import pandas as pd

# CSV path for the Kaggle Jigsaw dataset
df = pd.read_csv("train.csv")  # change path if needed

# Show basic info
df.head(), df.shape

(                 id                                       comment_text  toxic  \
 0  0000997932d777bf  Explanation\nWhy the edits made under my usern...      0   
 1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   
 2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   
 3  0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...      0   
 4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   
 
    severe_toxic  obscene  threat  insult  identity_hate  
 0             0        0       0       0              0  
 1             0        0       0       0              0  
 2             0        0       0       0              0  
 3             0        0       0       0              0  
 4             0        0       0       0              0  ,
 (159571, 8))