Skip to content

cartus0910/2018SUMMER_R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2018SUMMER_R

INFORMATIONS

WEEK 1

week1_link

PREPARE

  • Download R & RStudio
  • Read the reserved words in R

EXERCISES

  • Calculation in R
  • Variable types in R
  • Use R markdown to create a document

HW1

  • Create repository for this class
  • Create folder as week1
  • Add hw1.Rmd as a note of this week
  • hw_link

WEEK 2

week2_link

PREPARE

  • Install packages:ggplot, ggmap, scales
  • Read the Document of ggplot

EXERCISES_MORNING

  • Use the build-in dataframe: diamond
  • draw several diagrams including: bar, histogram, point, boxplot
  • exercise_part1_link

EXERCISES_AFTERNOON

  • Create function to connect URL
  • Use function to crawl the website
  • Organizing data by filtering out unnecessary data
  • Create wordcloud
  • exercise_part2_link

EXTRA 1

EXTRA 2

  • Download the data of Debris Flow Monitoring Station
  • Use ggplot to plot the stations on the map
  • extra_practice_link2

WEEK 3

week3_link

PREPARE

  • Read the document of EDA, TI-IDF, PCA, K-means.
  • Review the homework last week.

EXERCISES_MORNING

EXERCISES_AFTERNOON

  • Crawl the data from PTT movie boards.
  • Restore the content by the time of posting(hours).
  • Calculate TF and IDF for the data and create TF-IDF matrix for each hours.
  • exercise_part2_link (Note that this only include TF-IDF process)
  • Get the introduction of best selling books in 5 popular online book store(誠品:C、博客來:B、TAAZE:T、城邦:P、金石堂:J).
  • Calculate TF and IDF for the data and create TF-IDF matrix for each book introduction.
  • Create the word cloud to see the key words in all document.
  • Use PCA to reduce the dimensionality.
  • Plot the result to check the degree of decentralization.
  • Determine optimal number of clusters
  • Use k-means to classify each introductions into 5 clusters
  • exercise_part3_link (This is the HW with process of TFIDF -> PCA -> K-means)

WEEK 4

week4_link

PREPARE

  • Download data from Student Alcohol Consumption Survey.
  • Set the initial purpose of this project: Examines whether alcohol consumption has any predictive power over student average grades.

EXERCISES

  • Use EDA to organizing the data.
  • Plot to see the relationships bwtween alcohol consumption and grades.
  • Combine the two grades together without worrying the lost of features after calculated the adjusted R squared to make sure it works.
  • Run T-test to see whether having alcohol consumption or not influence the their academic performance?
  • Set h0 = "Having alcohol consumption or not will not influence the their academic performance?" and h1 = "Having alcohol consumption or not will influence the their academic performance?"
  • P-value < 0.05, so h0 is rejected in favor of h1 within the 95% confidence interval.
  • Plot to see how workday and weekend alcohol consumption influence on average grades.
  • Conclude that levels of alcohol consumption have a limited predict power over their grades.
  • Set the purpose for next stage:
    1. Examine which variable has the most predicative power over student grades.
    2. Build model to predict the grades.
  • Use linear model to find the variables having significant impact on grades, which are studytime, schoolsup(Extra educational support), higher(Wants to take higher education).
  • Check the three variables found above is more important in this model using ANOVA.
  • Find out that aiming to having higher education or not is the most important variable.
  • Build a regression tree model to see how the top important variables affect the grades.
  • Calculate normalized mean squared error to see the accuracy of these two models.
  • Plot to see how these two models work, and see that they are not quite predictive.
  • Build Random Forest model instead.
  • Calculate the NMSE of this model, and then get the result of about 0.2, which indicate that this model is more valid than the ones above.
  • Plot to see that random forest model works well comparing to the others, though it seems to underpredict the grades of the group of lower grades student, and overpredict the grades of the group of higher grades students.
  • Plot to see which variables are important in this model.
  • Produce a partial dependence plot for each feature in the best performing Random Forest model with 500 trees., giving a graphical depiction of the marginal effect of a feature on the response.
  • Give a look at some features which would be conventionally thought as important using correlational plot.
  • Got the conclusion!
  • exercise_link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published