# Linear Regression

In [1]:
using Gadfly

include( "../src/StatsWorkbook.jl" )
using .StatsWorkbook

## How It Works

Suppose that a measurable quantity $Y$ changes in a way that is approximately linear with respect to a random variable $X$. That is, suppose there are numbers $\beta_1$ and $\beta_2$ such that:

$$Y \approx \beta_1 x + \beta_0$$

Linear regression produces a function $\hat{Y}$ that approximates this relationship by using a sample to estimate the parameters $\beta_1$ and $\beta_0$.

The estimate is produced by minimizing the _residual sum of squares (RSS)_, i.e. the square of the differences between the actual $Y$ and those predicted by the linear model:

$$RSS := \Sigma_i ( y_i - \hat{\beta_1} x_i - \hat{\beta_0})^2$$

The minimal RSS with respect to $\beta_1$ and $\beta_0$ has a closed-form solution in terms of the $x_i$ and $y_i$ and the sample means $\bar{x}$ and $\bar{y}$:

$$\hat{\beta_1} = \frac{\Sigma_{i=1}^{n}{( x_i - \bar{x} )( y_i - \bar{y} )}}{\Sigma_{i=1}^{n}{( x_i - \bar{x} )^2}}$$

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$$

## Assumptions

- the linear relationship $\hat{Y} \approx \hat{\beta_1} \hat{X} + \hat{\beta_0}$ between $X$ and $Y$ is reasonably close to the real relationship between $X$ and $Y$.
- the error between the true values $Y$ and the measured values $\hat{Y}$ is normally distributed about the regression line.

## Example

Here's some simulated data, generated by adding a normally-distributed error term to 256 points along the line $y = 5x + 2$:

In [2]:
xs = [0.7579806186017164, 0.17092801090229615, 0.585026247028458, 0.4089603349829607, 0.2600575024453753, 0.30169616054930737, 0.3142696739223376, 0.7437711568115426, 0.14189568213108883, 0.49712212256210697, 0.8547325910796959, 0.5296991734073895, 0.4981963046447164, 0.9568099037210471, 0.5226537135244591, 0.2649166133805281, 0.20661929586226901, 0.7746533797976041, 0.4896744420369299, 0.410395582195036, 0.3943992782185628, 0.1560092191953637, 0.33350174294184143, 0.4169045516819012, 0.0874006591870422, 0.8562780289583105, 0.1537131704352539, 0.9284341014350734, 0.6569396692498166, 0.1463126721480339, 0.4575845483041643, 0.09114566327978024, 0.5654145125406547, 0.19070209511768788, 0.8768776384572923, 0.053112065720414714, 0.3608194757589045, 0.2739576149493894, 0.3447013945229991, 0.8478205731511008, 0.21028554078969397, 0.7358843907584627, 0.8765003329411982, 0.24635051417595322, 0.1645064499992317, 0.1260081610002024, 0.08448990087198394, 0.9098551224026861, 0.9829221564070423, 0.6309697832628478, 0.6603782061461387, 0.955512989605138, 0.32486195800432505, 0.7293953929496888, 0.808796780973069, 0.22215102750141735, 0.5490984436339814, 0.3123446965155692, 0.05925795146591972, 0.6774433022405515, 0.6541981946418522, 0.33448312910581635, 0.526889701672949, 0.7251495947420774, 0.49005988616492546, 0.4997597024622311, 0.7890648312369863, 0.95946514518695, 0.7505298628875916, 0.4946629032213077, 0.2209995903135713, 0.20512523926520587, 0.398287881242924, 0.0033022037210002075, 0.9507662993302248, 0.7181886652862433, 0.1987210137723976, 0.14742555776595845, 0.3016261627198402, 0.5794674726105618, 0.36742917158207034, 0.21996600006149758, 0.5797054830422734, 0.472317649219379, 0.1481561994555347, 0.6336224261073247, 0.49989632672775786, 0.7332217149067781, 0.19687368290053842, 0.2644704361986341, 0.9723071894195969, 0.8278216554008275, 0.20326336208468176, 0.060750394441438704, 0.25132892741849533, 0.3782777121320582, 0.3558982417375274, 0.3152823745441298, 0.7642349047628929, 0.5539646368370079, 0.8993034790949153, 0.6755300278507179, 0.08416066970049885, 0.09928377286561574, 0.7722312108896274, 0.7195131205987044, 0.5914543894417188, 0.5092554619018776, 0.6324948541990398, 0.3713207504462235, 0.01661883453183055, 0.7504854473779135, 0.24445381391084808, 0.19031542986035643, 0.8862186079685483, 0.7121104178109106, 0.019892807894170916, 0.05464790805507791, 0.72452766812994, 0.1881613765837138, 0.5645058651778634, 0.041887108677669094, 0.9212229310004694, 0.28118662073753287, 0.2470071503634257, 0.8822482100872888, 0.42786042965376514, 0.14012617622557433, 0.9602916456008668, 0.32101700384572407, 0.6074019637139005, 0.4211078448038341, 0.14860961267167117, 0.08254671198373531, 0.7942470144516938, 0.16475534797905378, 0.43111321559950744, 0.9039880326378282, 0.7721152489528331, 0.844863602417151, 0.15590748317515946, 0.05731673006751081, 0.7935882403143948, 0.5135764503478302, 0.9660157282193151, 0.09838956722073222, 0.7825771062932616, 0.736750222075478, 0.3919829747064931, 0.7729745502183207, 0.21887140409020733, 0.11522575183512296, 0.07709263509674003, 0.9436568881041012, 0.9793333201676937, 0.3983212356104071, 0.49904855873440024, 0.7198810978620618, 0.1863075982157396, 0.7505131300256638, 0.41033134390252224, 0.8527642167051717, 0.7531526156520476, 0.03749660610086014, 0.5458561137772124, 0.7708272990745497, 0.91553312263421, 0.1503224223208095, 0.01131704663378108, 0.11126127450172651, 0.4133710423693364, 0.4274787911375355, 0.650549534031793, 0.2489138656679062, 0.3212101954729816, 0.3692344182310432, 0.28881158550357466, 0.3065152594627374, 0.15598253693469055, 0.034862968945976514, 0.06930743207530221, 0.9452343541103005, 0.4280227073991969, 0.3446524987490569, 0.16477846093121995, 0.9056744400015995, 0.47373328735967846, 0.5726393287129745, 0.7085660122967921, 0.6086152931794442, 0.5701770983809438, 0.735249121974402, 0.6422156480423116, 0.9039831548827744, 0.8494114123186514, 0.36721638661669553, 0.5092182316126537, 0.8564339919959587, 0.8781391162699583, 0.468544594559805, 0.03168863462395621, 0.253495909571708, 0.3276718640346987, 0.036954907272733095, 0.9785658507564494, 0.0370086224743269, 0.36870590676008885, 0.6873500402642898, 0.23475890217412765, 0.7717496953330387, 0.16355985353654345, 0.3167686636889435, 0.8288212155576615, 0.20699929858458876, 0.5509575562398066, 0.8530798468896157, 0.4952008781093835, 0.49742372198665663, 0.787640615351616, 0.7971433435394595, 0.512647788380054, 0.47969767325605406, 0.3509555909463322, 0.26055755651346146, 0.3948416444311009, 0.4179811351634575, 0.7807161277456189, 0.5726958669153162, 0.5328788607388864, 0.7083674479672899, 0.09865652980932849, 0.3983078798519042, 0.9882637624976014, 0.7773378928676293, 0.5301083736116887, 0.32658448518506655, 0.7515300024728881, 0.5332888478133591, 0.06160259582908778, 0.8298093023518793, 0.8229480920321244, 0.2177863416773358, 0.07200235984910952, 0.4859252105354741, 0.8554321507640412, 0.33888590952767683, 0.7930964410384371, 0.3046365932855861, 0.9597722853155808, 0.21167965677239287, 0.4331575145450641, 0.27842464254599397, 0.3200031811267807, 0.19432942564572553, 0.2599819995188941, 0.24697391553584502]

256-element Array{Float64,1}:
 0.7579806186017164 
 0.17092801090229615
 0.585026247028458  
 0.4089603349829607 
 0.2600575024453753 
 0.30169616054930737
 0.3142696739223376 
 0.7437711568115426 
 0.14189568213108883
 0.49712212256210697
 0.8547325910796959 
 0.5296991734073895 
 0.4981963046447164 
 ⋮                  
 0.8554321507640412 
 0.33888590952767683
 0.7930964410384371 
 0.3046365932855861 
 0.9597722853155808 
 0.21167965677239287
 0.4331575145450641 
 0.27842464254599397
 0.3200031811267807 
 0.19432942564572553
 0.2599819995188941 
 0.24697391553584502

In [3]:
ys = [5.892925248347911, 3.4581358227750227, 5.3949147774509205, 5.473790069878774, 3.165881966022717, 4.153601890807217, 3.1908906043073144, 5.521834375787493, 1.3133906722271125, 4.097381278656962, 5.686737786793823, 5.123464229976932, 4.158202242522768, 6.749409691522013, 4.847430108392827, 3.3346856053191725, 2.9286881464293035, 5.538192584008962, 7.095010196508131, 2.3711434830797358, 2.829418142859105, 2.896110060940267, 2.2345484020005895, 4.097547123904944, -0.232170718481004, 7.579472206466516, 2.149390020539338, 6.264730656326691, 5.198150959796232, 3.3423207711191094, 5.205615251703065, 3.5330403099943952, 4.36264218914875, 3.480042631136703, 6.489377916628011, 2.787042753079087, 3.450472496639117, 2.9858654364502835, 4.008893276305431, 6.758164282831211, 2.7141793975909985, 5.626559567453735, 4.116862375281344, 3.6055142646112617, 1.0112485603367265, 2.4045707193674266, 0.7656511258315157, 7.981590097043474, 4.436336702432674, 5.4242518694823385, 4.888499900475088, 7.481583673014725, 3.0176592481180275, 5.303591980898592, 6.488869312399575, 0.8328877852611045, 4.126333128187817, 3.946528744064399, 0.8987293487377648, 7.407863474371339, 4.321676832236537, 3.6718833443818766, 4.526975350658877, 4.838899155638455, 6.134999323184028, 4.765308752655363, 4.9028052186919515, 6.596254825961779, 5.141903340103719, 3.9514923878002453, 3.1716651132571485, 2.423728894069938, 4.707914555276459, 1.1689049114050212, 7.08054293462265, 6.084020557309028, 3.7801100454239656, 4.0080317470415014, 4.238647079871859, 4.826530875671895, 4.8002028626183435, 4.00718260419227, 4.777733750984686, 2.548535455784373, 3.2368107655792, 2.922029011071165, 5.918442718534516, 5.4184960958072885, 3.6217820923073627, 2.3978981710681158, 9.419824675505428, 6.546626996433186, 3.112160139579728, 1.1942168054544717, 6.04888703532491, 3.746714585217336, 4.907318268379015, 2.1636120244777075, 6.251960947325631, 4.646769166954121, 6.828555183892667, 6.223078913587039, 4.278364914148632, 2.6245948380928636, 5.798951530407545, 5.85670621251288, 6.304937483608985, 5.119759845191826, 5.076518999966471, 3.5211191885368045, 2.2437166802725037, 5.5617961116107955, 3.5067062100550785, 2.630205749870874, 5.120733633052186, 4.813532792952805, 1.4012387128054495, 3.009600938389024, 6.3587435863193855, 2.265529153907823, 5.282052526439736, 3.622617733895641, 5.234927975727037, 2.6915009640697973, 5.645073945920101, 4.535535510108662, 4.206722782214189, 3.496136578020632, 6.328436916621413, 1.8991793691674799, 5.190360141116741, 2.6755336558660536, 3.5438876936543267, 2.121935938719221, 5.316625837319625, 4.500382451885583, 4.669389641416667, 7.811160643668618, 6.115370656943813, 6.136900540658943, 2.269765985614792, 0.9791338090025619, 5.706673808748171, 4.693090898750483, 7.659448320516333, 2.887237178657778, 6.7665961396522825, 6.248240713058282, 3.779780366137169, 7.644390419586925, 4.762981537800137, 2.4721308166194444, 0.4393778460142126, 7.083699978774206, 7.603354437552098, 3.3550238058471256, 4.609417692182088, 4.56847629470124, 2.989161086861134, 4.901155208527865, 2.7835626191355063, 4.935697908249308, 3.283088285162584, 3.086081626132671, 4.370635681484564, 4.480631358011978, 6.7613308461657775, 2.6753388295741822, 3.0672328476825523, 1.08598785677428, 3.8336919240500404, 6.170235434490715, 5.921091048854116, 2.1330754237370204, 3.901081144906949, 5.6136744887896075, 5.1937760385088865, 2.37228721791384, 3.5592598284051995, 0.925463464851364, 1.4565207283389559, 6.137442802423493, 4.770528019925852, 2.570326929522406, 1.9214292212866464, 4.891995469337068, 4.66432627379017, 5.541640942969522, 4.129215826807988, 3.8941619939874155, 7.843919435621231, 6.646827763412995, 4.792813618706621, 6.09228936546179, 6.372419989595695, 4.899456166743802, 4.935052120443816, 6.642073147370789, 8.13190217191789, 4.538093335935618, 3.0676344738075487, 3.3064275799957317, 3.7941901157033344, 0.24997400274833748, 7.693442182810826, 3.1659170149444034, 4.1882103422888255, 4.9250430487289405, 4.92244993152387, 5.722089887420062, 1.8633204561135488, 3.1447368179723942, 6.823159416073004, 3.9823063315909035, 4.0018366972109325, 5.781036971174917, 4.690386826613898, 3.6124471473051374, 6.001295904377244, 4.555651988904729, 6.597340792303393, 4.017710395821467, 4.1629994537039146, 1.942157339105524, 4.912514296803218, 3.9287736850427084, 7.865331309737008, 4.676413937952023, 4.9347029151358, 6.521301543358843, 2.8014519876220145, 3.3802826281503373, 8.329318877583315, 5.580704948133226, 5.09895506105163, 3.0592367807806244, 6.053022674975492, 4.337488595252576, 3.784748937246917, 6.741403443934019, 4.583812226032756, 2.7235892704459066, 1.0965567630127728, 3.523083198584504, 7.915230102655162, 1.5957064957374012, 6.15314321242325, 3.57842145145717, 8.053043652596735, 1.802576018108377, 5.038501445105806, 2.5737939341337888, 3.4378722451000017, 3.0550478651765136, 5.360406930973124, 3.954599986622715]

256-element Array{Float64,1}:
 5.892925248347911 
 3.4581358227750227
 5.3949147774509205
 5.473790069878774 
 3.165881966022717 
 4.153601890807217 
 3.1908906043073144
 5.521834375787493 
 1.3133906722271125
 4.097381278656962 
 5.686737786793823 
 5.123464229976932 
 4.158202242522768 
 ⋮                 
 7.915230102655162 
 1.5957064957374012
 6.15314321242325  
 3.57842145145717  
 8.053043652596735 
 1.802576018108377 
 5.038501445105806 
 2.5737939341337888
 3.4378722451000017
 3.0550478651765136
 5.360406930973124 
 3.954599986622715 

Since the values for $\hat{\beta_0}$ and $\hat{\beta_1}$ have a closed form solution, it's straightforward to calculate them from the data:

In [5]:
β1, β0 = leastsquarescoefficients( xs, ys )

(5.119555340059018, 1.953874265843682)

For a representative sample with normally distributed error, the slope and intercept of the regression line are fairly good approximations of the true relation, which has slope $5$ and intercept $2$.