Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TwitGet Bad Request 400 #100

Open
Copy-link opened this issue Aug 11, 2020 · 6 comments
Open

TwitGet Bad Request 400 #100

Copy-link opened this issue Aug 11, 2020 · 6 comments

Comments

@Copy-link
Copy link
Contributor

I'm getting this problem using the old method, the one that doesn't involve a headless browser. It started up just a few hours ago. At first I thought it was IP-based, like I hit some sort of request limit, but not only did it not go away when I threw up a VPN, it seems to be errorring out on the Twitter profile page, not even getting to the step for the search API json.

This strikes me as very odd and makes me wonder if the error that is being thrown by xA-Scraper is even accurate.

Main.TwitGet.StatusMgr - INFO - GetArtist - veyopixel (ID: 437)
Main.WebRequest - INFO - Fetching content at URL: https://twitter.com/veyopixel
Main.WebRequest - INFO - Have additional GET parameters!
Main.WebRequest - INFO -        Item: 'Accept' -> 'application/json, text/javascript, */*; q=0.01'
Main.WebRequest - INFO -        Item: 'Referer' -> 'https://twitter.com/veyopixel'
Main.WebRequest - INFO -        Item: 'User-Agent' -> 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'
Main.WebRequest - INFO -        Item: 'X-Twitter-Active-User' -> 'yes'
Main.WebRequest - INFO -        Item: 'X-Requested-With' -> 'XMLHttpRequest'
Main.WebRequest - INFO -        Item: 'Accept-Language' -> 'en-US'
Main.WebRequest - WARNING - Error opening page: https://twitter.com/veyopixel at Tue Aug 11 18:30:39 2020 On Attempt 1.
Main.WebRequest - WARNING - Error Code: HTTP Error 400: Bad Request
Main.WebRequest - WARNING - Original URL: https://twitter.com/veyopixel
Main.WebRequest - INFO - Have additional GET parameters!
Main.WebRequest - INFO -        Item: 'Accept' -> 'application/json, text/javascript, */*; q=0.01'
Main.WebRequest - INFO -        Item: 'Referer' -> 'https://twitter.com/veyopixel'
Main.WebRequest - INFO -        Item: 'User-Agent' -> 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'
Main.WebRequest - INFO -        Item: 'X-Twitter-Active-User' -> 'yes'
Main.WebRequest - INFO -        Item: 'X-Requested-With' -> 'XMLHttpRequest'
Main.WebRequest - INFO -        Item: 'Accept-Language' -> 'en-US'
Main.WebRequest - ERROR - Failed to retrieve Website : https://twitter.com/veyopixel at Tue Aug 11 18:30:53 2020 All Attempts Exhausted
Main.WebRequest - CRITICAL - Critical Failure to retrieve page! https://twitter.com/veyopixel at Tue Aug 11 18:30:53 2020, attempt 2
Main.WebRequest - CRITICAL - Error:
Main.WebRequest - CRITICAL - Exiting
Main.TwitGet.StatusMgr - ERROR - Traceback (most recent call last):
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\twitScrape.py", line 256, in go
Main.TwitGet.StatusMgr - ERROR -     errored |= self.getArtist(aid=aid, artist=name, ctrlNamespace=ctrlNamespace)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\twitScrape.py", line 206, in getArtist
Main.TwitGet.StatusMgr - ERROR -     for tweet in intf.get_all_tweets(artist, min_date):
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 281, in get_all_tweets
Main.TwitGet.StatusMgr - ERROR -     interval_start = self.get_joined_date(username, twit_headers)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 149, in get_joined_date
Main.TwitGet.StatusMgr - ERROR -     ctnt = self.stateful_get("https://twitter.com/{user}".format(user=user), headers=twit_headers)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 22, in stateful_get
Main.TwitGet.StatusMgr - ERROR -     return self.__stateful_get("getpage", url, headers, params)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 54, in __stateful_get
Main.TwitGet.StatusMgr - ERROR -     page = func(url, addlHeaders=headers)
Main.TwitGet.StatusMgr - ERROR -   File "C:\Python38\lib\site-packages\WebRequest\WebRequestClass.py", line 195, in getpage
Main.TwitGet.StatusMgr - ERROR -     return self._unwaf_func("_getpage", requestedUrl, *args, **kwargs)
Main.TwitGet.StatusMgr - ERROR -   File "C:\Python38\lib\site-packages\WebRequest\WebRequestClass.py", line 160, in _unwaf_func
Main.TwitGet.StatusMgr - ERROR -     return target_func(requestedUrl, *args, **kwargs)
Main.TwitGet.StatusMgr - ERROR -   File "C:\Python38\lib\site-packages\WebRequest\WebRequestClass.py", line 658, in _getpage
Main.TwitGet.StatusMgr - ERROR -     raise Exceptions.FetchFailureError("Failed to retreive page", requestedUrl,
Main.TwitGet.StatusMgr - ERROR - WebRequest.Exceptions.FetchFailureError: <FetchFailureError 400 -> 'Bad Request' for url: https://twitter.com/veyopixel ({b''})>
Main.TwitGet.StatusMgr - ERROR -
@Copy-link
Copy link
Contributor Author

Copy-link commented Aug 11, 2020

Shit. I just realized what the problem is... or at least I think this is what's wrong. They fucked with the joined date, again.

image

The relevant code, which is now broken...

	def get_joined_date(self, user, twit_headers):

		ctnt = self.stateful_get("https://twitter.com/{user}".format(user=user), headers=twit_headers)
		html = HTML(html=ctnt)
		joined_items = html.find(".ProfileHeaderCard-joinDateText")
		if not joined_items:
			raise exceptions.AccountDisabledException("Could not retreive artist joined date. "
				"This usually means the account has been disabled!")

		assert len(joined_items) == 1, "Too many joined items?"
		joined = joined_items[0]

		posttime = dateparser.parse(joined.attrs['title'])

		self.log.info("User %s joined twitter at %s", user, posttime)

		return posttime

.ProfileHeaderCard-joinDateText no longer exists, and now one would have to lookup the text within div[data-testid="UserProfileHeader_Items"] > span, but I'm not entirely sure how to lookup attributes other than class and id with this Python library.

I don't understand why this is throwing '400 Bad Request' instead of 'Could not retreive artist joined date.', however. Either more than one thing is wrong, or it's just not tripping if not joined_items for some reason.

@fake-name
Copy link
Owner

Dammit, I hate minified/obsfucated CSS.

@fake-name
Copy link
Owner

fake-name commented Aug 12, 2020

The reason you're seeing the 400 error is probably because they added more UA/header sniffing, which is catching that WebRequest isn't acting exactly like a browser.

More and more I'm considering trying to create a library around either the firefox or chromium HTTP(s) client code.

@Copy-link
Copy link
Contributor Author

Copy-link commented Aug 12, 2020

Not sniffing as it turns out, but now it's actually directly checking to see if JavaScript was loaded and blocking you if not.

<script nonce="ZmY4Y2NjZGUtNjZkMi00ZTY4LWIyZWEtMWE0ZDM1YmE2MDg4">
  if (!window.__SCRIPTS_LOADED__['main']) {
    document.getElementById('ScriptLoadFailure').style.display = 'block';
  }
</script>

@Copy-link
Copy link
Contributor Author

This does look like something I'd have to jump to your headless browser method to, though I have a feeling even if I did, updates would be necessary since the way to obtain the join date changed so drastically.

@Copy-link
Copy link
Contributor Author

Temporary workaround.

	def get_joined_date(self, user, twit_headers):

		ctnt = self.stateful_get("https://nitter.net/{user}".format(user=user), headers=twit_headers)
		html = HTML(html=ctnt)
		joined_items = html.find(".profile-joindate > span > div")
		if not joined_items:
			raise exceptions.AccountDisabledException("Could not retreive artist joined date. "
				"This usually means the account has been disabled!")

		assert len(joined_items) == 1, "Too many joined items?"
		joined = joined_items[0]

		posttime = dateparser.parse(joined.text.replace("Joined ",""))

		self.log.info("User %s joined twitter at %s", user, posttime)

		return posttime

God bless nitter.net

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants