-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data #1
Comments
How is the parsing going on? Is any help needed? |
I have all the pass plays parsed but not the run plays. If a play contains "pass", it's getting parsed. |
Pushed the parsed run and pass plays. The data is all tabular. The receiver column is dirty but everything else is pretty clean. Did a check by counting by value for each. The data doesn't contain any special team plays or penalties. To aggregate data just use some form of group by. Also pushed the apache zeppelin notebook I used to do the parsing. Since you used python so far I'm cool with using pandas, scikit-learn for ML from here on out. |
Added the year and week to each row. Also filtered out any commas in receiver or playString to avoid issues. |
Pushed the new set. Let me know if you see any other discrepancies. Around 1000 plays were added back into the data set. |
I also notice that there is not attribute to define output attribute i.e. pass or run. I assume that the plays with NaN in the complete attribute are all run plays. Am I correct? |
Right, both complete and passer are redundant for a pass play. If either of them is not filled in (NaN) then it was a run play. |
So is it not required to add the output attribute to the training set? I can add an attribute playType to describe if its a pass or run play and push the data if its okay. |
Yeah that works for me. |
It might be best to rename the file incase there's issues with the original and we have to replace it again. |
Yes. I will create a new data file with the output attribute. |
There is actually an issue with this data, I'll push up a new one shortly. |
Okay sure. |
Pushed the new data, 4 entries were invalid |
I added a new file with the output attribute. pushed the file |
Hey I feel that there are few issues with the parsed data. The distance [short, deep] should be non null for pass plays right? There is a mismatch in the number of non-null values complete and distance. Is it correct? |
Sometimes there are pass plays that may not be labeled short or deep but if you have specific example I can check it out in the parser. |
I saw few plays that had NaN for complete and passes, which would mean that they are run plays but had a distance value as short. When i looked at those particular plays, one was actually a pass play that was incomplete and one was run play with a gain of 1 yard but had the word "short gain" in commentary. |
I'm looking into it. |
So those two should be fixed and other issues like those should be fixed. Let me know if you see anything else. |
there is one data point with injury information. the play string is stored in the yardsGained attribute. I think that should be removed. You find it by looking at the unique values for yardsGained attribute. |
Pushed new data with the INJURY UPDATE filtered out. |
The data has been added to the data folder in the repo.
The text was updated successfully, but these errors were encountered: