Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reddit Scraping #33

Merged
merged 21 commits into from
Apr 7, 2023
Merged

Reddit Scraping #33

merged 21 commits into from
Apr 7, 2023

Conversation

andygello555
Copy link
Owner

No description provided.

- Refactored all the Binding + Paginator + API types and interfaces to the api package. This works similarly to the Binding interface that existed within the monday package except the request produced by Binding.Request is an interface (api.Request), and the client taken by Binding.Execute is also an interface (api.Client) (07/03/2023 - 16:28:13)
- This allows us to create entire schema's of bindings that we can then add to an instance of the API type which acts as a wrapper for a set of Bindings (07/03/2023 - 16:29:14)
- Removed the definitions of types that are now defined in api from the monday package (07/03/2023 - 16:29:44)
- Updated all the bindings in the monday and models packages to use the new function signatures (07/03/2023 - 16:30:07)
- Updated the monday subcommand in the CLI and the Measure phase to use the new paginator type (07/03/2023 - 16:30:38)
- Upgrade gotils to v2.1.2 (08/03/2023 - 12:59:02)
- Refactored all the NewBinding logic into the api package. This means that Bindings for the entire project can be created through the api.NewBinding method. Due to this, I have removed the now unecessary code from monday/bindings.go as well as changed the Bindings in the monday and models packages to use this new API (08/03/2023 - 15:17:54)
- Added the reddit/types.go file to hold all the return and response types for the Reddit API bindings (08/03/2023 - 16:23:19)
- Paginator instance has now been replaced by the Paginator interface and two new Paginator implementations: typedPaginator and paginator. NewTypedPaginator returns a Paginator that is type aware. This can only be used by Bindings that are set to their own global variables. NewPaginator returns a Paginator that is not type aware, instead it returns a Paginator[[]any, []any] (08/03/2023 - 18:27:25)
- Added a refresh to reddit.Client.Run (08/03/2023 - 18:50:40)
- Made the Binding interface chainable and started the argument type checking with the new BindingParam type and Params() interface method #24 (09/03/2023 - 11:52:58)
- Added type checks to bindingProto.Execute #24 (09/03/2023 - 13:39:05)
- Added a way of defining BindingParams for interface values that will also be type checked by bindingProto.TypeCheckArgs #24 (10/03/2023 - 16:28:50)
- Added the api.Params function that can construct a list of BindingParams from a list of argument groupings (10/03/2023 - 16:43:24)
- Updated all the Bindings created via derivatives of the NewBinding method to set the Params method of their respective Bindings (10/03/2023 - 16:44:37)
- Added an example for the Params function (10/03/2023 - 16:44:56)
- Added a test for the Params function (10/03/2023 - 16:45:22)
- Added a test for the bindingProto.TypeCheckArgs (10/03/2023 - 16:45:39)
- Added some more fields to the BindingProto struct to aid with interface checking (10/03/2023 - 16:46:11)
- I may have gone a bit overboard with the api package... (10/03/2023 - 16:48:39)
- Moved everything to do with paginators, bindings, and params in api to their own files to tidy things up a bit (10/03/2023 - 16:56:31)
- Updated top binding for the reddit API (10/03/2023 - 18:50:55)
- Added a lot more types to the reddit API types.go file. The listingWrapper probably needs more fleshing out #2 (10/03/2023 - 18:51:38)
- Added the Binding.ArgsFromStrings interface method and implementation (10/03/2023 - 18:53:12)
- Added the reddit.RateLimit type for tracking rate limits in the reddit API, as well as the RateLimitsConfig interface for configuring rates (13/03/2023 - 15:16:26)
- A lot more reddit.Types taken mostly from https://github.com/vartanbeno/go-reddit/reddit/things.go (13/03/2023 - 15:23:14)
- Paginator's are now aware of multiple different types of pagination parameters thanks to the Binding.Params method (14/03/2023 - 14:14:10)
- The untyped paginator is now the type Paginator[any, any] which makes more sense and allows us to actually use the Afterable interafce (14/03/2023 - 16:22:05)
- Paginators now handle paginator params that are out of order  (15/03/2023 - 14:37:32)
- Fully implemented the top, comments, and user_where bindings in the reddit.API (16/03/2023 - 15:07:18)
- Added the Paginator.Until method that can be supplied a predicate function (20/03/2023 - 14:36:47)
- Added the start of the RedditDiscoveryPhase (20/03/2023 - 14:37:04)
- Added the SubredditFetch function and a test for it main_test (20/03/2023 - 14:37:17)
- Added a few constants to ScrapeConstants for the new Reddit discovery coroutine (20/03/2023 - 14:37:36)
- Added the DeveloperType enum type (21/03/2023 - 15:56:35)
- Added the Developer.Type field of type DeveloperType (21/03/2023 - 15:56:55)
- Added the Developer.RedditPublicMetrics field of type RedditUserMetrics which stores karma for RedditDeveloperTypes (21/03/2023 - 15:57:31)
- Changed DB types of Game.Developers and Game.VerifiedDeveloperUsernames to varchar(20)[]. 20 chars now instead of 18 to suit Reddit usernames (21/03/2023 - 15:58:27)
- Added the DeveloperSnapshot.RedditPostIDs pq.StringArray field to store the names of subreddits and the ID of a post as pairs of strings (21/03/2023 - 16:07:15)
- Added the samples/sampleRedditPostsCommentsAndUsers.json file which might come in handy for testing (21/03/2023 - 19:03:26)
- Registered the DeveloperType enum in db.init() (21/03/2023 - 19:03:42)
- Updated DiscoveryBatch to take a PostIterable interface instead of a dictionary of tweets so that I can pass an array of PostCommentsAndUser instances to it (21/03/2023 - 19:04:49)
- General refactoring of processes related to the DiscoveryBatch procedure (21/03/2023 - 19:05:44)
- Modified Developers and VerifiedDeveloperUsernames fields in Game to take a list of DeveloperType prefixed Usernames so we can keep both Reddit and Twitter usernames in the same arrays (23/03/2023 - 10:06:47)
- The above change meant that I had to modify some other bits of code, most noteably: Developer.Games, Game scrape procedures, DeveloperSnapshot.calculateGameField, MeasurePhase, DeletePhase (23/03/2023 - 10:08:33)
- Added the RedditUserPostTimes and the RedditDeveloperSnapshots cached fields (23/03/2023 - 13:19:06)
- Added a goroutine for RedditDiscoveryPhase into the DiscoveryPhase procedure (23/03/2023 - 14:16:48)
- Paginators now try multiple times to get the latest rate limit (23/03/2023 - 15:43:40)
- Added write mutexes to the bindingProto for the attrs map and the attrFuncs slice (23/03/2023 - 16:35:06)
- paginatorCheckRateLimit will now only return an error if the rate limit cannot be found rather than also if the latest rate limit is before the current time (23/03/2023 - 17:27:33)
- Added the Log method to the RateLimitedClient interface so I could see what is going on with rate limit errors being thrown (23/03/2023 - 17:29:48)
- Added the PostsConsumed field to DiscoveryUpdateSnapshotStats (24/03/2023 - 11:03:21)
- replaced all occurrences of time.Now() with time.Now().UTC() because heck timezones (24/03/2023 - 11:58:04)
- Paginator.Until's predicate function now takes the currently collected pages (27/03/2023 - 11:33:27)
- Updated UpdateDeveloper to deal with Reddit Developers. It does this by making paginated requests to the user_overview binding until it can find 11 (or less) total posts and comments that exist after the start time (27/03/2023 - 13:26:06)
- Waiting for Scout procedure running on VM to finish before updating the TestUpdatePhase test  (27/03/2023 - 13:26:52)
- Turned CachedFieldIterator into an interface then made the old implementation into an implementation of this interface. I then created the mergedCachedFieldIterators implementation that can iterate over many of the CachedFieldIterators (27/03/2023 - 13:51:33)
- Added the CachedField.Type method which returns the CachedFieldType of a CachedField which is useful for the MergedCachedFieldIterator (27/03/2023 - 14:02:59)
- Added the CachedFieldIterator.Field method which returns the cached field that the iterator is for (27/03/2023 - 14:03:32)
- Modified the SnapshotsPhase to handle Reddit developers (27/03/2023 - 14:50:15)
- Added the sampleRedditIDsUsernames.csv sample as well as updating sampleUserIDs.txt to be 50/50 Twitter and Reddit users (27/03/2023 - 15:27:57)
- Think SnapshotPhase and TestSnapshotPhase have now been updated (28/03/2023 - 16:34:55)
- Updated gotils dependency to newest version with shiny new slices methods (28/03/2023 - 16:35:32)
- Added the Error Reddit Type for returning more descriptive errors from Client.Run (29/03/2023 - 18:02:27)
- Updated the TestDisablePhase test to check the disabled developers for both Twitter and Reddit developers (30/03/2023 - 12:48:58)
- Updated the DisablePhase to iterate over each models.DeveloperType and disable the required amount for each (30/03/2023 - 12:49:23)
- Updated the State.UpdatedDevelopers/DisabledDevelopers/EnabledDevelopers fields to be an array of models.DeveloperMinimal which contains a models.DeveloperType field that can be used to filter these arrays (30/03/2023 - 14:35:50)
- State.DevelopersToEnable is now also a map of models.DeveloperType to integer counts after the refactor of the EnablePhase (30/03/2023 - 14:36:37)
- Refactored EnablePhase as well as the test for it (30/03/2023 - 14:37:39)
- Changed createFakeDevelopersWithSnaps to be zero-indexed rather than one-indexed because it was getting really annoying (30/03/2023 - 14:56:15)
- Starting the new TestMeasurePhase test (03/04/2023 - 16:08:58)
- Added the MondayConfig.TestMapping field which is a duplicate type of Mapping that is the mapping that will be used in testing (03/04/2023 - 16:16:23)
- createFakeDevelopersWithSnaps now also creates Itch.IO games (03/04/2023 - 16:25:22)
- Updated the whereZeroVerified query in the DeletePhase (04/04/2023 - 11:27:59)
- Added the models.Game.Advocates method that returns a the models.Developer's that have tweeted/posted about a given Game (04/04/2023 - 11:29:03)
- Added the postIDPairs methods to MeasureContext.Funcs that will construct an array of structs containing the subreddit and post pairs within the DeveloperSnapshot.RedditPostIDs array (04/04/2023 - 11:30:15)
- Updated the measure email template to display the correct post IDs depending on the Developer's type + username is prefixed by either @ or u/ depending the Developer's type (04/04/2023 - 11:31:44)
- Moved pretty much all of the graphql library to the monday package because errors returned by the Monday API are not performed to the GraphQL standard (04/04/2023 - 13:06:45)
- Added binding names to all Monday bindings in bindings.go as well as all bindings in game.go and steam_app.go (04/04/2023 - 13:08:10)
- Added the monday.Error type that implements error into which errors returned by the Monday API, as JSON, can be unmarshalled (04/04/2023 - 13:09:48)
- createFakeDevelopersWithSnaps now creates snaps with CreatedAt set to be a linear range (04/04/2023 - 14:20:05)
- createFakeDevelopersWithSnaps now creates a unique ID for Steam Games from the current developer index and the current game index which can be reversed (04/04/2023 - 14:20:52)
- Added the cat function to MeasureContext.Funcs (04/04/2023 - 15:16:58)
- Fixed some issues with the measure email template not being descriptive enough about the types of Developers (04/04/2023 - 15:42:44)
- Added the models.DeveloperType.EnumValue which returns the character representing a DeveloperType (04/04/2023 - 15:44:51)
- Added the DeleteItem Monday API client binding (04/04/2023 - 16:12:53)
- Added the monday.Me binding, for use in testing (05/04/2023 - 12:33:35)
- Added the additionalColumnValues to the models.AddGameToMonday binding (05/04/2023 - 13:01:51)
- Added the MustTypePaginate and MustPaginate functions to paginator.go that lets me create paginators without so much error checking (05/04/2023 - 13:21:48)
- Finally implemented the TestMeasurePhase test (05/04/2023 - 13:43:36)
@andygello555 andygello555 self-assigned this Apr 7, 2023
@andygello555 andygello555 linked an issue Apr 7, 2023 that may be closed by this pull request
37 tasks
@andygello555 andygello555 merged commit 5e95c44 into main Apr 7, 2023
@andygello555 andygello555 deleted the 2-reddit-scraping branch April 7, 2023 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraping indie game subreddits for developers and games
1 participant