GitHub

2018SUMMER_R

INFORMATIONS

Name: Cartus You
ID: B06208002
Dep.:NTU Geography
Task each week
1. WEEK1_HW
2. WEEK2_HW1 | WEEK2_HW2 | WEEK2_EXTRA
3. WEEK3_HW1 | WEEK3_PARTIAL | WEEK3_HW2
4. WEEK4_HW

WEEK 1

week1_link

PREPARE

Download R & RStudio
Read the reserved words in R

EXERCISES

Calculation in R
Variable types in R
Use R markdown to create a document

HW1

Create repository for this class
Create folder as week1
Add hw1.Rmd as a note of this week
hw_link

WEEK 2

week2_link

PREPARE

Install packages:ggplot, ggmap, scales
Read the Document of ggplot

EXERCISES_MORNING

Use the build-in dataframe: diamond
draw several diagrams including: bar, histogram, point, boxplot
exercise_part1_link

EXERCISES_AFTERNOON

Create function to connect URL
Use function to crawl the website
Organizing data by filtering out unnecessary data
Create wordcloud
exercise_part2_link

EXTRA 1

Calculate the most-used words in the title of thesis in NTU Geography
Download data from National Digital Library of Theses And Dessertation in Taiwan
Import the data as .csv
Create wordcloud and lettercloud (NOTE: The letter cloud cannot adjust the word size relatively to the frequency)
extra_practice_link1

EXTRA 2

Download the data of Debris Flow Monitoring Station
Use ggplot to plot the stations on the map
extra_practice_link2

WEEK 3

week3_link

PREPARE

Read the document of EDA, TI-IDF, PCA, K-means.
Review the homework last week.

EXERCISES_MORNING

Download data from 新北市歷年重大災害一覽表, 遭受災害救助情形
Determine the purpose: Find out whether the intervals in which the major disaster happened have higher emergency allowance.
Found conclusion: No, there is no positive correlation between the two.
Check the data and point out the deficiency of the original data.
exercise_part1_link

EXERCISES_AFTERNOON

Crawl the data from PTT movie boards.
Restore the content by the time of posting(hours).
Calculate TF and IDF for the data and create TF-IDF matrix for each hours.
exercise_part2_link (Note that this only include TF-IDF process)
Get the introduction of best selling books in 5 popular online book store(誠品:C、博客來:B、TAAZE:T、城邦:P、金石堂:J).
Calculate TF and IDF for the data and create TF-IDF matrix for each book introduction.
Create the word cloud to see the key words in all document.
Use PCA to reduce the dimensionality.
Plot the result to check the degree of decentralization.
Determine optimal number of clusters
Use k-means to classify each introductions into 5 clusters
exercise_part3_link (This is the HW with process of TFIDF -> PCA -> K-means)

WEEK 4

week4_link

PREPARE

Download data from Student Alcohol Consumption Survey.
Set the initial purpose of this project: Examines whether alcohol consumption has any predictive power over student average grades.

EXERCISES

Use EDA to organizing the data.
Plot to see the relationships bwtween alcohol consumption and grades.
Combine the two grades together without worrying the lost of features after calculated the adjusted R squared to make sure it works.
Run T-test to see whether having alcohol consumption or not influence the their academic performance?
Set h0 = "Having alcohol consumption or not will not influence the their academic performance?" and h1 = "Having alcohol consumption or not will influence the their academic performance?"
P-value < 0.05, so h0 is rejected in favor of h1 within the 95% confidence interval.
Plot to see how workday and weekend alcohol consumption influence on average grades.
Conclude that levels of alcohol consumption have a limited predict power over their grades.
Set the purpose for next stage:
1. Examine which variable has the most predicative power over student grades.
2. Build model to predict the grades.
Use linear model to find the variables having significant impact on grades, which are studytime, schoolsup(Extra educational support), higher(Wants to take higher education).
Check the three variables found above is more important in this model using ANOVA.
Find out that aiming to having higher education or not is the most important variable.
Build a regression tree model to see how the top important variables affect the grades.
Calculate normalized mean squared error to see the accuracy of these two models.
Plot to see how these two models work, and see that they are not quite predictive.
Build Random Forest model instead.
Calculate the NMSE of this model, and then get the result of about 0.2, which indicate that this model is more valid than the ones above.
Plot to see that random forest model works well comparing to the others, though it seems to underpredict the grades of the group of lower grades student, and overpredict the grades of the group of higher grades students.
Plot to see which variables are important in this model.
Produce a partial dependence plot for each feature in the best performing Random Forest model with 500 trees., giving a graphical depiction of the marginal effect of a feature on the response.
Give a look at some features which would be conventionally thought as important using correlational plot.
Got the conclusion!
exercise_link

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
final_pre		final_pre
week1		week1
week2		week2
week3		week3
week4		week4
week5/alc&grades		week5/alc&grades
.Rhistory		.Rhistory
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2018SUMMER_R

WEEK 1

WEEK 2

WEEK 3

WEEK 4

About

Releases

Packages

Languages

cartus0910/2018SUMMER_R

Folders and files

Latest commit

History

Repository files navigation

2018SUMMER_R

WEEK 1

WEEK 2

WEEK 3

WEEK 4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages